Project Name- Retail Sales Prediction : Predicting sales of a major store chain Rossmann (Supervised Regression)¶

Project Summary -¶

Rossman Sales Prediction data contains historical sales data for a retail store chain. The data includes information about the store such as Competitior’s Detail, holiday’s, number of the customers, sale transaction's date and amount of sale on each day. We were tasked to forecast the "Sales" for the test set.

After understanding the data and getting variables, we first gathered and cleaned the data, handled the null values and finally for getting better results we merged two datasets after that we have also typecasted the needed features into required format by type casting in order to visualize them properly. We performed indepth EDA and plotted different types of graphs by separating them into univariate, bivariate and multivariate categories as a result, We came accross some meaningful insights that helped us to make future decisions of ML model pipeline. Then further on, using feature engineering and data preprocessing we have extracted new features like PromoDuration and CompetitionDuration with the help of some features which are not directly impacting to Sales. We also tried to get some impacting features by removing multicollinearity within the independent variables with the help of various inflation factor(VIF). Under the umbrella of feature engineering we have detected and treated the outliers with the help of IQR technique and capped all the outliers of continous features in 25-75 percentile. Also, we noticed that some of the features were categorical in nature and ML model can not understand the language of alphabets(strings).So, we have encoded them into numericals using One-Hot Encoding technique as they were unordered in nature.

In order to get normally distributed data we have applied various transformation techniques such as Logarithmic Transformation, Exponential Transformation, Square root Transformation and some others as well and plotted the quantie-quantile plot to visualize how far our data points are from the normal distribution.To scale the data We used the sklearn library StandardScaler.

Now as we are ready with our final features we splitted it into training and testing sets. Next, we choose some machine learning algorithms and use the training data to train the model. Finally, we evaluated the model's performance on the testing data to see how well it is able to predict the sales for the real time data. For this task we used many machine learning algorithms, such as linear regression, decision trees, random forests, LightGBM and XGboost. For the less complex models like Linear Regression, we could achieve the r2 score of 0.75 and accuracy i.e 100-MAPE of 93% even after applying regularisation techniques such as Lasso,Ridge and Elstic Net. In order to capture more variance and train our model more aptly, we decided to go for more complex models one by one. After training our datasets on decision trees, random forests, LightGBM and XGboost, we could gather the r2 score of 0.94 with the best accuracy of 97% using XGboost with mean absolute percentage error of only 2%. Also we got the mean of residuals as 0.0 which is indicating towards perfectly normally distributed residuals which is one of the good characteristics of good residual plot. From the above experiments and identifications, we have choosen the XGboost as our final optimal model among all 5 models for deployment as it is predicting the highest accuracy with the least error.

Overall, while building a machine learning model on Rossman Sales Prediction Data, we applied combination of data processing, machine learning techniques, and model evaluation skills. It was a challenging task and we faced some failures as well but with the right approach and knowledge, we successfully created a model that can accurately predict sales upto six weeks in advance!

Problem Statement¶

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied. You are provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column for the test set. Note that some stores in the dataset were temporarily closed for refurbishment.

Rossmann Stores Data.csv - historical data including Sales

Store.csv - supplemental information about the stores

Data fields

Most of the fields are self-explanatory. The following are descriptions for those that aren't.

Id - an Id that represents a (Store, Date) duple within the test set

Store - a unique Id for each store

Sales - the turnover for any given day (this is what you are predicting)

Customers - the number of customers on a given day

Open - an indicator for whether the store was open: 0 = closed, 1 = open

StateHoliday - indicates a state holiday.

SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools

StoreType - differentiates between 4 different store models: a, b, c, d

Assortment - describes an assortment level: a = basic, b = extra, c = extended

CompetitionDistance - distance in meters to the nearest competitor store

CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened

Promo - indicates whether a store is running a promo on that day

Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating

Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2

PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started a new. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store.

Let's Begin !¶

1. Know Your Data¶

Import Libraries¶

In [2]:
# Importing basic libraries for data processing
import numpy as np
import pandas as pd
import math
from datetime import datetime

# Importing libraries for data visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# importing missingo library which helps us to visualize the missing values
import missingno as msno

# Adding this to ignore future warnings
import warnings
warnings.filterwarnings("ignore")

# Importing pearson corelation library for hypothesis testing
from scipy.stats import pearsonr

# Importing libraries for model implimentation
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor 
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor

# Importing libraries for hyperparameter tuning
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

# Importing libraries for visualizing decison tree
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import tree
from IPython.display import SVG
from graphviz import Source
from IPython.display import display

# Importing regression metrics to check the model performance
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_percentage_error

# Importing module to see execution time in mili seconds
%load_ext autotime

# Improting library for saving model as pickle file
import pickle

# Importing SHAP for model explainability
time: 0 ns (started: 2024-09-07 17:51:18 +00:00)

Dataset Loading¶

In [3]:
# Load Dataset from local directory location in jupyter

df_rossmann= pd.read_csv("Rossmann.csv")
df_store= pd.read_csv("store.csv")
time: 1.88 s (started: 2024-09-07 17:51:18 +00:00)

Dataset First View¶

In [4]:
# Dataset First Look of rossmann dataset
df_rossmann.head()
Out[4]:
Store DayOfWeek Date Sales Customers Open Promo StateHoliday SchoolHoliday
0 1 5 2015-07-31 5263 555 1 1 0 1
1 2 5 2015-07-31 6064 625 1 1 0 1
2 3 5 2015-07-31 8314 821 1 1 0 1
3 4 5 2015-07-31 13995 1498 1 1 0 1
4 5 5 2015-07-31 4822 559 1 1 0 1
time: 500 ms (started: 2024-09-07 17:51:20 +00:00)
In [5]:
# Dataset First Look of store dataset
df_store.head()
Out[5]:
Store StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear PromoInterval
0 1 c a 1270.0 9.0 2008.0 0 NaN NaN NaN
1 2 a a 570.0 11.0 2007.0 1 13.0 2010.0 Jan,Apr,Jul,Oct
2 3 a a 14130.0 12.0 2006.0 1 14.0 2011.0 Jan,Apr,Jul,Oct
3 4 c c 620.0 9.0 2009.0 0 NaN NaN NaN
4 5 a a 29910.0 4.0 2015.0 0 NaN NaN NaN
time: 16 ms (started: 2024-09-07 17:51:21 +00:00)

Dataset Rows & Columns count¶

In [6]:
# Dataset Rows & Columns count
print(f"Rossmann dataset: Rows={df_rossmann.shape[0]}, Columns={df_rossmann.shape[1]}") 
print(f"Store dataset: Rows={df_store.shape[0]}, Columns={df_store.shape[1]}") 
Rossmann dataset: Rows=1017209, Columns=9
Store dataset: Rows=1115, Columns=10
time: 703 ms (started: 2024-09-07 17:51:21 +00:00)

Dataset Information¶

In [7]:
# Rossmann Dataset Info
df_rossmann.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1017209 entries, 0 to 1017208
Data columns (total 9 columns):
 #   Column         Non-Null Count    Dtype 
---  ------         --------------    ----- 
 0   Store          1017209 non-null  int64 
 1   DayOfWeek      1017209 non-null  int64 
 2   Date           1017209 non-null  object
 3   Sales          1017209 non-null  int64 
 4   Customers      1017209 non-null  int64 
 5   Open           1017209 non-null  int64 
 6   Promo          1017209 non-null  int64 
 7   StateHoliday   1017209 non-null  object
 8   SchoolHoliday  1017209 non-null  int64 
dtypes: int64(7), object(2)
memory usage: 69.8+ MB
time: 1.58 s (started: 2024-09-07 17:51:21 +00:00)
In [8]:
# Store Dataset Info
df_store.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1115 entries, 0 to 1114
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Store                      1115 non-null   int64  
 1   StoreType                  1115 non-null   object 
 2   Assortment                 1115 non-null   object 
 3   CompetitionDistance        1112 non-null   float64
 4   CompetitionOpenSinceMonth  761 non-null    float64
 5   CompetitionOpenSinceYear   761 non-null    float64
 6   Promo2                     1115 non-null   int64  
 7   Promo2SinceWeek            571 non-null    float64
 8   Promo2SinceYear            571 non-null    float64
 9   PromoInterval              571 non-null    object 
dtypes: float64(5), int64(2), object(3)
memory usage: 87.2+ KB
time: 15 ms (started: 2024-09-07 17:51:23 +00:00)

Duplicate Values¶

In [9]:
# Dataset Duplicate Value Count
print(f"Number of duplicated rows in Rossmann dataset: {df_rossmann.duplicated().sum()}")
print(f"Number of duplicated rows in Store dataset: {df_store.duplicated().sum()}")
Number of duplicated rows in Rossmann dataset: 0
Number of duplicated rows in Store dataset: 0
time: 1.11 s (started: 2024-09-07 17:51:23 +00:00)

We do not have any duplicated rows in both the dataset and that is very good for us.

Missing Values/Null Values¶

In [10]:
# Missing Values/Null Values Count
print("-"*50)
print(f"Null values count in Rossmann dataset:\n{df_rossmann.isna().sum()}\n")
print("-"*50)
print(f"Null values count in Store dataset:\n{df_store.isna().sum()}")
print("-"*50)
print(f"Infinite values count in Rossmann dataset:\n{df_rossmann.isin([np.inf, -np.inf]).sum()}\n")
print("-"*50)
print(f"Infinite values count in Store dataset:\n{df_store.isin([np.inf, -np.inf]).sum()}")
print("-"*50)
--------------------------------------------------
Null values count in Rossmann dataset:
Store            0
DayOfWeek        0
Date             0
Sales            0
Customers        0
Open             0
Promo            0
StateHoliday     0
SchoolHoliday    0
dtype: int64

--------------------------------------------------
Null values count in Store dataset:
Store                          0
StoreType                      0
Assortment                     0
CompetitionDistance            3
CompetitionOpenSinceMonth    354
CompetitionOpenSinceYear     354
Promo2                         0
Promo2SinceWeek              544
Promo2SinceYear              544
PromoInterval                544
dtype: int64
--------------------------------------------------
Infinite values count in Rossmann dataset:
Store            0
DayOfWeek        0
Date             0
Sales            0
Customers        0
Open             0
Promo            0
StateHoliday     0
SchoolHoliday    0
dtype: int64

--------------------------------------------------
Infinite values count in Store dataset:
Store                        0
StoreType                    0
Assortment                   0
CompetitionDistance          0
CompetitionOpenSinceMonth    0
CompetitionOpenSinceYear     0
Promo2                       0
Promo2SinceWeek              0
Promo2SinceYear              0
PromoInterval                0
dtype: int64
--------------------------------------------------
time: 1.02 s (started: 2024-09-07 17:51:24 +00:00)

We don't have null or infinite values Rossmann's dataset but have some null values in Store dataset and we have to deal with it in future.

In [11]:
# Visualizing the missing values in Rossmann dataset
msno.bar(df_rossmann,figsize=(10,5), color="tab:blue")
Out[11]:
<Axes: >
No description has been provided for this image
time: 1.41 s (started: 2024-09-07 17:51:25 +00:00)
In [1]:
msno.bar(df_store, figsize=(10, 4), color="orange")

# Add title to the plot
plt.title("Missing Data Overview for df_store", fontsize=16)

# Show the plot
plt.show()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[1], line 1
----> 1 msno.bar(df_store, figsize=(10, 4), color="orange")
      3 # Add title to the plot
      4 plt.title("Missing Data Overview for df_store", fontsize=16)

NameError: name 'msno' is not defined

What did you know about your dataset?¶

Till now we get to know the following points about our dataset:

  1. 'Rossmann dataset' is having 1017209 rows and 9 columns and does not have any null value.
  2. 'Stores dataset' is having 1115 rows and 10 columns. It contains null values in total six features viz. CompetitionDistance,CompetitionOpenSinceMonth, CompetitionOpenSinceYear, Promo2SinceWeek,Promo2SinceYear and PromoInterval.
  3. There are no duplicate values present in both the datasets.
  4. There are total 4 categorical features in 'Rossmann' dataset namely Open, Promo, StateHoliday and SchoolHoliday and 'Stores' contain categorical features namely StoreType, Assortment, Promo2, PromoInterval.

2. Understanding Your Variables¶

In [13]:
# Dataset Columns
print(f"Column names of Rossmann dataset is:\n{df_rossmann.columns.tolist()}")
print(f"Column names of Store dataset is:\n{df_store.columns.tolist()}")
Column names of Rossmann dataset is:
['Store', 'DayOfWeek', 'Date', 'Sales', 'Customers', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday']
Column names of Store dataset is:
['Store', 'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval']
time: 0 ns (started: 2024-09-07 17:51:27 +00:00)
In [14]:
#Rossmann Dataset Describe
df_rossmann.describe(include="all")
Out[14]:
Store DayOfWeek Date Sales Customers Open Promo StateHoliday SchoolHoliday
count 1.017209e+06 1.017209e+06 1017209 1.017209e+06 1.017209e+06 1.017209e+06 1.017209e+06 1017209 1.017209e+06
unique NaN NaN 942 NaN NaN NaN NaN 5 NaN
top NaN NaN 2013-01-02 NaN NaN NaN NaN 0 NaN
freq NaN NaN 1115 NaN NaN NaN NaN 855087 NaN
mean 5.584297e+02 3.998341e+00 NaN 5.773819e+03 6.331459e+02 8.301067e-01 3.815145e-01 NaN 1.786467e-01
std 3.219087e+02 1.997391e+00 NaN 3.849926e+03 4.644117e+02 3.755392e-01 4.857586e-01 NaN 3.830564e-01
min 1.000000e+00 1.000000e+00 NaN 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 NaN 0.000000e+00
25% 2.800000e+02 2.000000e+00 NaN 3.727000e+03 4.050000e+02 1.000000e+00 0.000000e+00 NaN 0.000000e+00
50% 5.580000e+02 4.000000e+00 NaN 5.744000e+03 6.090000e+02 1.000000e+00 0.000000e+00 NaN 0.000000e+00
75% 8.380000e+02 6.000000e+00 NaN 7.856000e+03 8.370000e+02 1.000000e+00 1.000000e+00 NaN 0.000000e+00
max 1.115000e+03 7.000000e+00 NaN 4.155100e+04 7.388000e+03 1.000000e+00 1.000000e+00 NaN 1.000000e+00
time: 860 ms (started: 2024-09-07 17:51:27 +00:00)
In [15]:
# Store dataset describe
df_store.describe(include="all")
Out[15]:
Store StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear PromoInterval
count 1115.00000 1115 1115 1112.000000 761.000000 761.000000 1115.000000 571.000000 571.000000 571
unique NaN 4 3 NaN NaN NaN NaN NaN NaN 3
top NaN a a NaN NaN NaN NaN NaN NaN Jan,Apr,Jul,Oct
freq NaN 602 593 NaN NaN NaN NaN NaN NaN 335
mean 558.00000 NaN NaN 5404.901079 7.224704 2008.668857 0.512108 23.595447 2011.763573 NaN
std 322.01708 NaN NaN 7663.174720 3.212348 6.195983 0.500078 14.141984 1.674935 NaN
min 1.00000 NaN NaN 20.000000 1.000000 1900.000000 0.000000 1.000000 2009.000000 NaN
25% 279.50000 NaN NaN 717.500000 4.000000 2006.000000 0.000000 13.000000 2011.000000 NaN
50% 558.00000 NaN NaN 2325.000000 8.000000 2010.000000 1.000000 22.000000 2012.000000 NaN
75% 836.50000 NaN NaN 6882.500000 10.000000 2013.000000 1.000000 37.000000 2013.000000 NaN
max 1115.00000 NaN NaN 75860.000000 12.000000 2015.000000 1.000000 50.000000 2015.000000 NaN
time: 16 ms (started: 2024-09-07 17:51:28 +00:00)

Variables Description / Dataset Column Description¶

Rossmann Stores Data.csv - historical data including Sales

Store.csv - supplemental information about the stores

Data fields

Most of the fields are self-explanatory. The following are descriptions for those that aren't.

Store - a unique Id for each store

Sales - the turnover for any given day (this is what you are predicting)

Customers - the number of customers on a given day

Open - an indicator for whether the store was open: 0 = closed, 1 = open

StateHoliday - indicates a state holiday.

SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools

StoreType - differentiates between 4 different store models: a, b, c, d

Assortment - describes an assortment level: a = basic, b = extra, c = extended

CompetitionDistance - distance in meters to the nearest competitor store

CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened

Promo - indicates whether a store is running a promo on that day

Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating

Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2

PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started a new. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store.

Check Unique Values for each variable.¶

In [16]:
# Check Unique Values for each variable.
for i in df_rossmann.columns.tolist():
  print("The Unique Values of', i, 'are:", df_rossmann[i].unique())
The Unique Values of', i, 'are: [   1    2    3 ... 1113 1114 1115]
The Unique Values of', i, 'are: [5 4 3 2 1 7 6]
The Unique Values of', i, 'are: ['2015-07-31' '2015-07-30' '2015-07-29' '2015-07-28' '2015-07-27'
 '2015-07-26' '2015-07-25' '2015-07-24' '2015-07-23' '2015-07-22'
 '2015-07-21' '2015-07-20' '2015-07-19' '2015-07-18' '2015-07-17'
 '2015-07-16' '2015-07-15' '2015-07-14' '2015-07-13' '2015-07-12'
 '2015-07-11' '2015-07-10' '2015-07-09' '2015-07-08' '2015-07-07'
 '2015-07-06' '2015-07-05' '2015-07-04' '2015-07-03' '2015-07-02'
 '2015-07-01' '2015-06-30' '2015-06-29' '2015-06-28' '2015-06-27'
 '2015-06-26' '2015-06-25' '2015-06-24' '2015-06-23' '2015-06-22'
 '2015-06-21' '2015-06-20' '2015-06-19' '2015-06-18' '2015-06-17'
 '2015-06-16' '2015-06-15' '2015-06-14' '2015-06-13' '2015-06-12'
 '2015-06-11' '2015-06-10' '2015-06-09' '2015-06-08' '2015-06-07'
 '2015-06-06' '2015-06-05' '2015-06-04' '2015-06-03' '2015-06-02'
 '2015-06-01' '2015-05-31' '2015-05-30' '2015-05-29' '2015-05-28'
 '2015-05-27' '2015-05-26' '2015-05-25' '2015-05-24' '2015-05-23'
 '2015-05-22' '2015-05-21' '2015-05-20' '2015-05-19' '2015-05-18'
 '2015-05-17' '2015-05-16' '2015-05-15' '2015-05-14' '2015-05-13'
 '2015-05-12' '2015-05-11' '2015-05-10' '2015-05-09' '2015-05-08'
 '2015-05-07' '2015-05-06' '2015-05-05' '2015-05-04' '2015-05-03'
 '2015-05-02' '2015-05-01' '2015-04-30' '2015-04-29' '2015-04-28'
 '2015-04-27' '2015-04-26' '2015-04-25' '2015-04-24' '2015-04-23'
 '2015-04-22' '2015-04-21' '2015-04-20' '2015-04-19' '2015-04-18'
 '2015-04-17' '2015-04-16' '2015-04-15' '2015-04-14' '2015-04-13'
 '2015-04-12' '2015-04-11' '2015-04-10' '2015-04-09' '2015-04-08'
 '2015-04-07' '2015-04-06' '2015-04-05' '2015-04-04' '2015-04-03'
 '2015-04-02' '2015-04-01' '2015-03-31' '2015-03-30' '2015-03-29'
 '2015-03-28' '2015-03-27' '2015-03-26' '2015-03-25' '2015-03-24'
 '2015-03-23' '2015-03-22' '2015-03-21' '2015-03-20' '2015-03-19'
 '2015-03-18' '2015-03-17' '2015-03-16' '2015-03-15' '2015-03-14'
 '2015-03-13' '2015-03-12' '2015-03-11' '2015-03-10' '2015-03-09'
 '2015-03-08' '2015-03-07' '2015-03-06' '2015-03-05' '2015-03-04'
 '2015-03-03' '2015-03-02' '2015-03-01' '2015-02-28' '2015-02-27'
 '2015-02-26' '2015-02-25' '2015-02-24' '2015-02-23' '2015-02-22'
 '2015-02-21' '2015-02-20' '2015-02-19' '2015-02-18' '2015-02-17'
 '2015-02-16' '2015-02-15' '2015-02-14' '2015-02-13' '2015-02-12'
 '2015-02-11' '2015-02-10' '2015-02-09' '2015-02-08' '2015-02-07'
 '2015-02-06' '2015-02-05' '2015-02-04' '2015-02-03' '2015-02-02'
 '2015-02-01' '2015-01-31' '2015-01-30' '2015-01-29' '2015-01-28'
 '2015-01-27' '2015-01-26' '2015-01-25' '2015-01-24' '2015-01-23'
 '2015-01-22' '2015-01-21' '2015-01-20' '2015-01-19' '2015-01-18'
 '2015-01-17' '2015-01-16' '2015-01-15' '2015-01-14' '2015-01-13'
 '2015-01-12' '2015-01-11' '2015-01-10' '2015-01-09' '2015-01-08'
 '2015-01-07' '2015-01-06' '2015-01-05' '2015-01-04' '2015-01-03'
 '2015-01-02' '2015-01-01' '2014-12-31' '2014-12-30' '2014-12-29'
 '2014-12-28' '2014-12-27' '2014-12-26' '2014-12-25' '2014-12-24'
 '2014-12-23' '2014-12-22' '2014-12-21' '2014-12-20' '2014-12-19'
 '2014-12-18' '2014-12-17' '2014-12-16' '2014-12-15' '2014-12-14'
 '2014-12-13' '2014-12-12' '2014-12-11' '2014-12-10' '2014-12-09'
 '2014-12-08' '2014-12-07' '2014-12-06' '2014-12-05' '2014-12-04'
 '2014-12-03' '2014-12-02' '2014-12-01' '2014-11-30' '2014-11-29'
 '2014-11-28' '2014-11-27' '2014-11-26' '2014-11-25' '2014-11-24'
 '2014-11-23' '2014-11-22' '2014-11-21' '2014-11-20' '2014-11-19'
 '2014-11-18' '2014-11-17' '2014-11-16' '2014-11-15' '2014-11-14'
 '2014-11-13' '2014-11-12' '2014-11-11' '2014-11-10' '2014-11-09'
 '2014-11-08' '2014-11-07' '2014-11-06' '2014-11-05' '2014-11-04'
 '2014-11-03' '2014-11-02' '2014-11-01' '2014-10-31' '2014-10-30'
 '2014-10-29' '2014-10-28' '2014-10-27' '2014-10-26' '2014-10-25'
 '2014-10-24' '2014-10-23' '2014-10-22' '2014-10-21' '2014-10-20'
 '2014-10-19' '2014-10-18' '2014-10-17' '2014-10-16' '2014-10-15'
 '2014-10-14' '2014-10-13' '2014-10-12' '2014-10-11' '2014-10-10'
 '2014-10-09' '2014-10-08' '2014-10-07' '2014-10-06' '2014-10-05'
 '2014-10-04' '2014-10-03' '2014-10-02' '2014-10-01' '2014-09-30'
 '2014-09-29' '2014-09-28' '2014-09-27' '2014-09-26' '2014-09-25'
 '2014-09-24' '2014-09-23' '2014-09-22' '2014-09-21' '2014-09-20'
 '2014-09-19' '2014-09-18' '2014-09-17' '2014-09-16' '2014-09-15'
 '2014-09-14' '2014-09-13' '2014-09-12' '2014-09-11' '2014-09-10'
 '2014-09-09' '2014-09-08' '2014-09-07' '2014-09-06' '2014-09-05'
 '2014-09-04' '2014-09-03' '2014-09-02' '2014-09-01' '2014-08-31'
 '2014-08-30' '2014-08-29' '2014-08-28' '2014-08-27' '2014-08-26'
 '2014-08-25' '2014-08-24' '2014-08-23' '2014-08-22' '2014-08-21'
 '2014-08-20' '2014-08-19' '2014-08-18' '2014-08-17' '2014-08-16'
 '2014-08-15' '2014-08-14' '2014-08-13' '2014-08-12' '2014-08-11'
 '2014-08-10' '2014-08-09' '2014-08-08' '2014-08-07' '2014-08-06'
 '2014-08-05' '2014-08-04' '2014-08-03' '2014-08-02' '2014-08-01'
 '2014-07-31' '2014-07-30' '2014-07-29' '2014-07-28' '2014-07-27'
 '2014-07-26' '2014-07-25' '2014-07-24' '2014-07-23' '2014-07-22'
 '2014-07-21' '2014-07-20' '2014-07-19' '2014-07-18' '2014-07-17'
 '2014-07-16' '2014-07-15' '2014-07-14' '2014-07-13' '2014-07-12'
 '2014-07-11' '2014-07-10' '2014-07-09' '2014-07-08' '2014-07-07'
 '2014-07-06' '2014-07-05' '2014-07-04' '2014-07-03' '2014-07-02'
 '2014-07-01' '2014-06-30' '2014-06-29' '2014-06-28' '2014-06-27'
 '2014-06-26' '2014-06-25' '2014-06-24' '2014-06-23' '2014-06-22'
 '2014-06-21' '2014-06-20' '2014-06-19' '2014-06-18' '2014-06-17'
 '2014-06-16' '2014-06-15' '2014-06-14' '2014-06-13' '2014-06-12'
 '2014-06-11' '2014-06-10' '2014-06-09' '2014-06-08' '2014-06-07'
 '2014-06-06' '2014-06-05' '2014-06-04' '2014-06-03' '2014-06-02'
 '2014-06-01' '2014-05-31' '2014-05-30' '2014-05-29' '2014-05-28'
 '2014-05-27' '2014-05-26' '2014-05-25' '2014-05-24' '2014-05-23'
 '2014-05-22' '2014-05-21' '2014-05-20' '2014-05-19' '2014-05-18'
 '2014-05-17' '2014-05-16' '2014-05-15' '2014-05-14' '2014-05-13'
 '2014-05-12' '2014-05-11' '2014-05-10' '2014-05-09' '2014-05-08'
 '2014-05-07' '2014-05-06' '2014-05-05' '2014-05-04' '2014-05-03'
 '2014-05-02' '2014-05-01' '2014-04-30' '2014-04-29' '2014-04-28'
 '2014-04-27' '2014-04-26' '2014-04-25' '2014-04-24' '2014-04-23'
 '2014-04-22' '2014-04-21' '2014-04-20' '2014-04-19' '2014-04-18'
 '2014-04-17' '2014-04-16' '2014-04-15' '2014-04-14' '2014-04-13'
 '2014-04-12' '2014-04-11' '2014-04-10' '2014-04-09' '2014-04-08'
 '2014-04-07' '2014-04-06' '2014-04-05' '2014-04-04' '2014-04-03'
 '2014-04-02' '2014-04-01' '2014-03-31' '2014-03-30' '2014-03-29'
 '2014-03-28' '2014-03-27' '2014-03-26' '2014-03-25' '2014-03-24'
 '2014-03-23' '2014-03-22' '2014-03-21' '2014-03-20' '2014-03-19'
 '2014-03-18' '2014-03-17' '2014-03-16' '2014-03-15' '2014-03-14'
 '2014-03-13' '2014-03-12' '2014-03-11' '2014-03-10' '2014-03-09'
 '2014-03-08' '2014-03-07' '2014-03-06' '2014-03-05' '2014-03-04'
 '2014-03-03' '2014-03-02' '2014-03-01' '2014-02-28' '2014-02-27'
 '2014-02-26' '2014-02-25' '2014-02-24' '2014-02-23' '2014-02-22'
 '2014-02-21' '2014-02-20' '2014-02-19' '2014-02-18' '2014-02-17'
 '2014-02-16' '2014-02-15' '2014-02-14' '2014-02-13' '2014-02-12'
 '2014-02-11' '2014-02-10' '2014-02-09' '2014-02-08' '2014-02-07'
 '2014-02-06' '2014-02-05' '2014-02-04' '2014-02-03' '2014-02-02'
 '2014-02-01' '2014-01-31' '2014-01-30' '2014-01-29' '2014-01-28'
 '2014-01-27' '2014-01-26' '2014-01-25' '2014-01-24' '2014-01-23'
 '2014-01-22' '2014-01-21' '2014-01-20' '2014-01-19' '2014-01-18'
 '2014-01-17' '2014-01-16' '2014-01-15' '2014-01-14' '2014-01-13'
 '2014-01-12' '2014-01-11' '2014-01-10' '2014-01-09' '2014-01-08'
 '2014-01-07' '2014-01-06' '2014-01-05' '2014-01-04' '2014-01-03'
 '2014-01-02' '2014-01-01' '2013-12-31' '2013-12-30' '2013-12-29'
 '2013-12-28' '2013-12-27' '2013-12-26' '2013-12-25' '2013-12-24'
 '2013-12-23' '2013-12-22' '2013-12-21' '2013-12-20' '2013-12-19'
 '2013-12-18' '2013-12-17' '2013-12-16' '2013-12-15' '2013-12-14'
 '2013-12-13' '2013-12-12' '2013-12-11' '2013-12-10' '2013-12-09'
 '2013-12-08' '2013-12-07' '2013-12-06' '2013-12-05' '2013-12-04'
 '2013-12-03' '2013-12-02' '2013-12-01' '2013-11-30' '2013-11-29'
 '2013-11-28' '2013-11-27' '2013-11-26' '2013-11-25' '2013-11-24'
 '2013-11-23' '2013-11-22' '2013-11-21' '2013-11-20' '2013-11-19'
 '2013-11-18' '2013-11-17' '2013-11-16' '2013-11-15' '2013-11-14'
 '2013-11-13' '2013-11-12' '2013-11-11' '2013-11-10' '2013-11-09'
 '2013-11-08' '2013-11-07' '2013-11-06' '2013-11-05' '2013-11-04'
 '2013-11-03' '2013-11-02' '2013-11-01' '2013-10-31' '2013-10-30'
 '2013-10-29' '2013-10-28' '2013-10-27' '2013-10-26' '2013-10-25'
 '2013-10-24' '2013-10-23' '2013-10-22' '2013-10-21' '2013-10-20'
 '2013-10-19' '2013-10-18' '2013-10-17' '2013-10-16' '2013-10-15'
 '2013-10-14' '2013-10-13' '2013-10-12' '2013-10-11' '2013-10-10'
 '2013-10-09' '2013-10-08' '2013-10-07' '2013-10-06' '2013-10-05'
 '2013-10-04' '2013-10-03' '2013-10-02' '2013-10-01' '2013-09-30'
 '2013-09-29' '2013-09-28' '2013-09-27' '2013-09-26' '2013-09-25'
 '2013-09-24' '2013-09-23' '2013-09-22' '2013-09-21' '2013-09-20'
 '2013-09-19' '2013-09-18' '2013-09-17' '2013-09-16' '2013-09-15'
 '2013-09-14' '2013-09-13' '2013-09-12' '2013-09-11' '2013-09-10'
 '2013-09-09' '2013-09-08' '2013-09-07' '2013-09-06' '2013-09-05'
 '2013-09-04' '2013-09-03' '2013-09-02' '2013-09-01' '2013-08-31'
 '2013-08-30' '2013-08-29' '2013-08-28' '2013-08-27' '2013-08-26'
 '2013-08-25' '2013-08-24' '2013-08-23' '2013-08-22' '2013-08-21'
 '2013-08-20' '2013-08-19' '2013-08-18' '2013-08-17' '2013-08-16'
 '2013-08-15' '2013-08-14' '2013-08-13' '2013-08-12' '2013-08-11'
 '2013-08-10' '2013-08-09' '2013-08-08' '2013-08-07' '2013-08-06'
 '2013-08-05' '2013-08-04' '2013-08-03' '2013-08-02' '2013-08-01'
 '2013-07-31' '2013-07-30' '2013-07-29' '2013-07-28' '2013-07-27'
 '2013-07-26' '2013-07-25' '2013-07-24' '2013-07-23' '2013-07-22'
 '2013-07-21' '2013-07-20' '2013-07-19' '2013-07-18' '2013-07-17'
 '2013-07-16' '2013-07-15' '2013-07-14' '2013-07-13' '2013-07-12'
 '2013-07-11' '2013-07-10' '2013-07-09' '2013-07-08' '2013-07-07'
 '2013-07-06' '2013-07-05' '2013-07-04' '2013-07-03' '2013-07-02'
 '2013-07-01' '2013-06-30' '2013-06-29' '2013-06-28' '2013-06-27'
 '2013-06-26' '2013-06-25' '2013-06-24' '2013-06-23' '2013-06-22'
 '2013-06-21' '2013-06-20' '2013-06-19' '2013-06-18' '2013-06-17'
 '2013-06-16' '2013-06-15' '2013-06-14' '2013-06-13' '2013-06-12'
 '2013-06-11' '2013-06-10' '2013-06-09' '2013-06-08' '2013-06-07'
 '2013-06-06' '2013-06-05' '2013-06-04' '2013-06-03' '2013-06-02'
 '2013-06-01' '2013-05-31' '2013-05-30' '2013-05-29' '2013-05-28'
 '2013-05-27' '2013-05-26' '2013-05-25' '2013-05-24' '2013-05-23'
 '2013-05-22' '2013-05-21' '2013-05-20' '2013-05-19' '2013-05-18'
 '2013-05-17' '2013-05-16' '2013-05-15' '2013-05-14' '2013-05-13'
 '2013-05-12' '2013-05-11' '2013-05-10' '2013-05-09' '2013-05-08'
 '2013-05-07' '2013-05-06' '2013-05-05' '2013-05-04' '2013-05-03'
 '2013-05-02' '2013-05-01' '2013-04-30' '2013-04-29' '2013-04-28'
 '2013-04-27' '2013-04-26' '2013-04-25' '2013-04-24' '2013-04-23'
 '2013-04-22' '2013-04-21' '2013-04-20' '2013-04-19' '2013-04-18'
 '2013-04-17' '2013-04-16' '2013-04-15' '2013-04-14' '2013-04-13'
 '2013-04-12' '2013-04-11' '2013-04-10' '2013-04-09' '2013-04-08'
 '2013-04-07' '2013-04-06' '2013-04-05' '2013-04-04' '2013-04-03'
 '2013-04-02' '2013-04-01' '2013-03-31' '2013-03-30' '2013-03-29'
 '2013-03-28' '2013-03-27' '2013-03-26' '2013-03-25' '2013-03-24'
 '2013-03-23' '2013-03-22' '2013-03-21' '2013-03-20' '2013-03-19'
 '2013-03-18' '2013-03-17' '2013-03-16' '2013-03-15' '2013-03-14'
 '2013-03-13' '2013-03-12' '2013-03-11' '2013-03-10' '2013-03-09'
 '2013-03-08' '2013-03-07' '2013-03-06' '2013-03-05' '2013-03-04'
 '2013-03-03' '2013-03-02' '2013-03-01' '2013-02-28' '2013-02-27'
 '2013-02-26' '2013-02-25' '2013-02-24' '2013-02-23' '2013-02-22'
 '2013-02-21' '2013-02-20' '2013-02-19' '2013-02-18' '2013-02-17'
 '2013-02-16' '2013-02-15' '2013-02-14' '2013-02-13' '2013-02-12'
 '2013-02-11' '2013-02-10' '2013-02-09' '2013-02-08' '2013-02-07'
 '2013-02-06' '2013-02-05' '2013-02-04' '2013-02-03' '2013-02-02'
 '2013-02-01' '2013-01-31' '2013-01-30' '2013-01-29' '2013-01-28'
 '2013-01-27' '2013-01-26' '2013-01-25' '2013-01-24' '2013-01-23'
 '2013-01-22' '2013-01-21' '2013-01-20' '2013-01-19' '2013-01-18'
 '2013-01-17' '2013-01-16' '2013-01-15' '2013-01-14' '2013-01-13'
 '2013-01-12' '2013-01-11' '2013-01-10' '2013-01-09' '2013-01-08'
 '2013-01-07' '2013-01-06' '2013-01-05' '2013-01-04' '2013-01-03'
 '2013-01-02' '2013-01-01']
The Unique Values of', i, 'are: [ 5263  6064  8314 ...   660 17815 23303]
The Unique Values of', i, 'are: [ 555  625  821 ... 3900   36 4065]
The Unique Values of', i, 'are: [1 0]
The Unique Values of', i, 'are: [1 0]
The Unique Values of', i, 'are: ['0' 'a' 'b' 'c' 0]
The Unique Values of', i, 'are: [1 0]
time: 437 ms (started: 2024-09-07 17:51:28 +00:00)
In [17]:
# Check Unique Values for each variable.
for j in df_store.columns.tolist():
  print("The Unique Values of', j, 'are:", df_store[j].unique())
The Unique Values of', j, 'are: [   1    2    3 ... 1113 1114 1115]
The Unique Values of', j, 'are: ['c' 'a' 'd' 'b']
The Unique Values of', j, 'are: ['a' 'c' 'b']
The Unique Values of', j, 'are: [1.270e+03 5.700e+02 1.413e+04 6.200e+02 2.991e+04 3.100e+02 2.400e+04
 7.520e+03 2.030e+03 3.160e+03 9.600e+02 1.070e+03 1.300e+03 4.110e+03
 3.270e+03 5.000e+01 1.384e+04 3.240e+03 2.340e+03 5.500e+02 1.040e+03
 4.060e+03 4.590e+03 4.300e+02 2.300e+03 6.000e+01 1.200e+03 2.170e+03
 4.000e+01 9.800e+03 2.910e+03 1.320e+03 2.240e+03 7.660e+03 5.400e+02
 4.230e+03 1.090e+03 2.600e+02 1.800e+02 1.180e+03 2.900e+02 4.880e+03
 9.710e+03 2.700e+02 1.060e+03 1.801e+04 6.260e+03 1.057e+04 4.500e+02
 3.036e+04 7.170e+03 7.200e+02 6.620e+03 4.200e+02 7.340e+03 2.840e+03
 5.540e+03 3.500e+02 2.050e+03 3.700e+03 2.256e+04 4.100e+02 2.500e+02
 1.130e+03 4.840e+03 1.750e+04 2.200e+03 1.650e+03 3.300e+02 2.244e+04
 1.996e+04 3.510e+03 3.320e+03 7.910e+03 2.370e+03 2.239e+04 2.710e+03
 1.181e+04 1.870e+03 4.800e+02 5.600e+02 1.069e+04 2.380e+03 2.410e+03
 2.400e+02 1.669e+04 1.462e+04 1.890e+03 8.780e+03 8.980e+03 1.514e+04
 1.793e+04 2.440e+03 1.500e+02 5.210e+03 3.900e+02 6.190e+03 1.390e+03
 1.930e+03 2.190e+03 3.300e+03 4.659e+04 7.890e+03 1.630e+03 2.093e+04
 4.510e+03 5.740e+03 6.800e+02 3.450e+03 3.580e+03 2.100e+03 2.290e+03
 3.570e+03 5.826e+04 1.676e+04 1.410e+03 7.600e+02 3.370e+03 1.350e+03
 2.000e+03 2.460e+03 9.000e+02 9.200e+02 5.190e+03 1.730e+03 2.536e+04
 1.700e+03 1.540e+03 2.930e+03 1.657e+04 2.800e+02 8.050e+03 8.540e+03
 2.090e+03 2.610e+03 3.183e+04 4.360e+03 1.780e+03 1.624e+04 1.642e+04
 3.050e+03 2.020e+03 2.950e+03 1.184e+04 8.530e+03 1.711e+04 2.970e+03
 5.340e+03 1.480e+03 1.160e+03 3.720e+03 1.000e+02 1.400e+02 1.254e+04
 9.800e+02 2.640e+03 1.100e+02 1.309e+04 4.130e+03 3.770e+03 1.250e+03
 1.710e+03 5.800e+03 1.261e+04 9.670e+03 3.560e+03 1.860e+03 1.936e+04
 8.500e+02 5.760e+03 1.470e+03 1.100e+03 2.770e+03 5.200e+02 1.697e+04
 2.200e+02 3.850e+03 4.210e+03 6.360e+03 2.026e+04 5.140e+03 4.900e+02
 5.630e+03 3.800e+02 6.870e+03 3.000e+02 1.168e+04 9.700e+02 1.505e+04
 4.030e+03 8.650e+03 1.900e+02 3.150e+03 6.400e+02 1.640e+03 1.000e+03
 1.353e+04 2.920e+03 7.930e+03 1.018e+04 1.080e+04 1.741e+04 6.680e+03
 3.840e+03 1.357e+04 4.370e+03 5.710e+03 1.420e+03 3.200e+02 6.100e+02
 1.110e+03 7.800e+02 6.880e+03 7.100e+02 1.310e+03 4.660e+03 7.000e+01
 3.400e+02 3.520e+03 2.233e+04 4.630e+03 8.000e+01 2.719e+04 2.100e+02
 1.534e+04 1.140e+03 4.580e+03 3.600e+02 4.520e+03 1.450e+03 1.618e+04
 8.480e+03 3.640e+03 2.960e+03 7.840e+03 9.260e+03 2.320e+03 1.864e+04
 6.970e+03 1.220e+03 2.260e+03 1.290e+03 1.460e+03 2.740e+03 8.000e+02
 6.540e+03 4.150e+03       nan 9.580e+03 1.984e+04 3.863e+04 1.200e+02
 1.543e+04 1.950e+03 2.470e+03 5.100e+03 1.866e+04 8.740e+03 1.130e+04
 1.416e+04 3.871e+04 9.000e+03 3.140e+03 3.233e+04 8.140e+03 8.400e+03
 1.314e+04 1.007e+04 3.130e+03 3.700e+02 6.700e+02 1.840e+03 4.040e+03
 9.000e+01 1.060e+04 1.590e+03 2.280e+03 8.080e+03 1.577e+04 1.865e+04
 8.090e+03 9.360e+03 1.649e+04 1.490e+03 8.880e+03 5.290e+03 1.500e+03
 9.720e+03 8.970e+03 2.060e+03 2.890e+03 2.040e+03 4.490e+03 1.362e+04
 6.470e+03 5.870e+03 8.250e+03 1.970e+03 1.112e+04 1.150e+03 1.571e+04
 1.600e+02 2.140e+03 6.630e+03 1.800e+03 2.613e+04 1.300e+02 6.690e+03
 1.600e+03 4.600e+02 2.120e+03 4.820e+03 1.085e+04 3.620e+03 2.313e+04
 5.360e+03 9.200e+03 5.830e+03 4.970e+03 1.080e+03 8.240e+03 5.890e+03
 1.560e+03 8.400e+02 8.460e+03 4.460e+03 6.210e+03 6.910e+03 4.650e+03
 1.620e+03 3.530e+03 2.880e+03 1.635e+04 1.287e+04 8.100e+02 3.003e+04
 1.302e+04 9.100e+02 3.900e+03 2.530e+03 5.000e+02 1.140e+04 1.510e+03
 3.970e+03 5.780e+03 1.850e+03 7.586e+04 2.645e+04 3.390e+03 3.405e+04
 1.790e+03 4.432e+04 4.160e+03 1.089e+04 3.110e+03 2.039e+04 5.260e+03
 5.300e+03 5.030e+03 1.481e+04 8.300e+03 7.700e+02 1.940e+03 7.470e+03
 2.550e+03 2.310e+03 1.430e+04 2.180e+03 1.496e+04 6.600e+02 4.680e+03
 1.740e+03 1.260e+03 5.470e+03 2.780e+03 1.610e+03 9.900e+02 1.308e+04
 8.200e+02 9.070e+03 1.280e+03 4.740e+03 8.260e+03 5.900e+02 4.000e+02
 1.126e+04 2.000e+01 2.249e+04 3.330e+03 2.510e+03 6.900e+03 1.861e+04
 7.160e+03 4.086e+04 2.062e+04 1.292e+04 1.816e+04 5.950e+03 4.700e+03
 6.000e+02 6.500e+02 7.280e+03 5.020e+03 5.800e+02 8.990e+03 3.760e+03
 2.330e+03 4.260e+03 3.040e+03 3.000e+03 3.910e+03 1.910e+03 1.210e+03
 7.000e+02 1.010e+03 4.270e+03 1.340e+03 2.110e+03 9.230e+03 1.190e+03
 4.400e+03 2.270e+03 1.270e+04 2.097e+04 1.700e+02 7.250e+03 1.360e+03
 4.400e+02 1.572e+04 3.340e+03 2.540e+03 3.306e+04 1.734e+04 8.220e+03
 1.095e+04 1.031e+04 1.837e+04 2.070e+03 2.490e+03 7.300e+02 8.940e+03
 9.910e+03 5.440e+03 3.000e+01 4.080e+03 6.920e+03 1.170e+03 1.074e+04
 5.100e+02 1.690e+03 2.870e+03 3.350e+03 1.164e+04 2.753e+04 9.790e+03
 1.017e+04 7.780e+03 8.040e+03 5.300e+02 2.300e+02 7.420e+03 2.130e+03
 1.457e+04 2.000e+02 6.930e+03 7.860e+03 1.680e+03 2.700e+03 1.708e+04
 1.517e+04 3.250e+03 4.140e+03 2.850e+03 2.005e+04 1.876e+04 1.504e+04
 3.030e+03 3.780e+03 8.300e+02 8.550e+03 7.830e+03 2.900e+03 1.147e+04
 4.870e+03 1.207e+04 3.200e+03 8.190e+03 1.532e+04 3.590e+03 5.650e+03
 5.900e+03 1.754e+04 4.054e+04 1.399e+04 1.527e+04 3.528e+04 8.600e+02
 1.920e+03 5.980e+03 6.400e+03 1.190e+04 4.380e+03 6.710e+03 1.370e+03
 1.765e+04 4.330e+03 4.574e+04 3.410e+03 8.670e+03 1.313e+04 1.978e+04
 2.390e+03 3.224e+04 2.649e+04 2.543e+04 9.820e+03 2.630e+03 2.064e+04
 1.699e+04 6.300e+02 5.390e+03 1.549e+04 3.210e+03 1.530e+03 9.770e+03
 1.728e+04 5.090e+03 7.180e+03 9.560e+03 4.833e+04 1.760e+03 2.477e+04
 3.870e+03 1.862e+04 1.277e+04 9.640e+03 2.590e+03 2.453e+04 1.621e+04
 1.757e+04 7.980e+03 3.290e+03 6.320e+03 5.070e+03 3.470e+03 2.720e+03
 1.460e+04 6.890e+03 2.765e+04 8.860e+03 5.000e+03 1.120e+03 9.400e+02
 1.404e+04 4.770e+03 3.440e+03 3.020e+03 6.270e+03 2.177e+04 7.400e+02
 2.137e+04 1.020e+03 9.680e+03 2.181e+04 1.062e+04 3.860e+03 2.179e+04
 2.919e+04 4.570e+03 7.550e+03 1.243e+04 1.970e+04 4.450e+03 1.867e+04
 1.937e+04 1.854e+04 3.920e+03 3.170e+03 7.290e+03 1.980e+03 1.248e+04
 3.100e+03 7.240e+03 1.871e+04 2.620e+03 6.420e+03 4.700e+02 5.150e+03
 1.570e+04 5.460e+03 2.235e+04 2.810e+03 2.820e+03 6.860e+03 1.802e+04
 1.670e+03 2.220e+03 1.430e+03 8.700e+02 6.300e+03 1.983e+04 9.430e+03
 2.362e+04 9.630e+03 4.180e+03 3.890e+03 4.420e+03 2.193e+04 2.480e+03
 3.460e+03 6.560e+03 5.840e+03 2.230e+03 1.964e+04 6.480e+03 4.610e+03
 6.330e+03 1.520e+03 3.740e+03 1.990e+03 3.641e+04 7.680e+03 1.375e+04
 2.715e+04 1.729e+04 2.699e+04 2.907e+04 3.750e+03 1.317e+04 5.080e+03
 1.319e+04 5.350e+03 3.230e+03 3.380e+03 3.430e+03 8.110e+03 6.250e+03
 1.202e+04 5.010e+03 1.805e+04 5.380e+03 1.668e+04 1.154e+04 2.210e+03
 4.300e+03 5.220e+03 9.990e+03 1.045e+04 6.900e+02 1.830e+03 5.330e+03
 1.400e+03 3.490e+03 1.900e+03 1.880e+03]
The Unique Values of', j, 'are: [ 9. 11. 12.  4. 10.  8. nan  3.  6.  5.  1.  2.  7.]
The Unique Values of', j, 'are: [2008. 2007. 2006. 2009. 2015. 2013. 2014. 2000. 2011.   nan 2010. 2005.
 1999. 2003. 2012. 2004. 2002. 1961. 1995. 2001. 1990. 1994. 1900. 1998.]
The Unique Values of', j, 'are: [0 1]
The Unique Values of', j, 'are: [nan 13. 14.  1. 45. 40. 26. 22.  5.  6. 10. 31. 37.  9. 39. 27. 18. 35.
 23. 48. 36. 50. 44. 49. 28.]
The Unique Values of', j, 'are: [  nan 2010. 2011. 2012. 2009. 2014. 2015. 2013.]
The Unique Values of', j, 'are: [nan 'Jan,Apr,Jul,Oct' 'Feb,May,Aug,Nov' 'Mar,Jun,Sept,Dec']
time: 16 ms (started: 2024-09-07 17:51:29 +00:00)
In [ ]:
 

3. Data Wrangling¶

1. Null values treatment¶

a). Handling null values from CompetitionDistance feature.¶
In [223]:
# Filter only numeric columns from df_store
numeric_cols = df_store.select_dtypes(include=[np.number]).columns
n_cols = 4
n_rows = math.ceil(len(numeric_cols) / n_cols)

fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 5 * n_rows))
axes = axes.flatten()

# Loop through each numeric column and create a boxplot
for idx, col in enumerate(numeric_cols):
    sns.boxplot(x=df_store[col], ax=axes[idx])
    axes[idx].set_title(f'Boxplot of {col}')
    axes[idx].set_xlabel('')

# Hide any remaining empty subplots
for i in range(len(numeric_cols), len(axes)):
    fig.delaxes(axes[i])

plt.tight_layout()
plt.show()
No description has been provided for this image
time: 1.11 s (started: 2024-09-08 10:07:24 +00:00)

In order to fill the null values of CompetitionDistance we have 4 options:

  1. Zero(0) >> Not effective as the corresponding values for CompetitionSinceMonth and CompetitionSinceYear are not zero.
  2. Mean >> Replacing null values with mean will create blunders as it contains outliers and mean is influenced with outliers.
  3. Median >> To get good results replacing with median can be a better choice.
  4. Mode >> Mode can also help to fill the null values in our case.

Since, we have two options(Mode and Median) we are going with the median.

In [19]:
# filling null values
df_store["CompetitionDistance"].fillna(df_store["CompetitionDistance"].median(), inplace=True)
time: 0 ns (started: 2024-09-07 17:51:30 +00:00)
In [20]:
# rechecking if we have any further null values in CompetitionDistance feature
df_store[df_store["CompetitionDistance"].isnull()].sum()
Out[20]:
Store                          0
StoreType                      0
Assortment                     0
CompetitionDistance          0.0
CompetitionOpenSinceMonth    0.0
CompetitionOpenSinceYear     0.0
Promo2                         0
Promo2SinceWeek              0.0
Promo2SinceYear              0.0
PromoInterval                  0
dtype: object
time: 110 ms (started: 2024-09-07 17:51:30 +00:00)
In [21]:
# rechecking if we have any null values
df_store.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1115 entries, 0 to 1114
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Store                      1115 non-null   int64  
 1   StoreType                  1115 non-null   object 
 2   Assortment                 1115 non-null   object 
 3   CompetitionDistance        1115 non-null   float64
 4   CompetitionOpenSinceMonth  761 non-null    float64
 5   CompetitionOpenSinceYear   761 non-null    float64
 6   Promo2                     1115 non-null   int64  
 7   Promo2SinceWeek            571 non-null    float64
 8   Promo2SinceYear            571 non-null    float64
 9   PromoInterval              571 non-null    object 
dtypes: float64(5), int64(2), object(3)
memory usage: 87.2+ KB
time: 78 ms (started: 2024-09-07 17:51:30 +00:00)

Hurray!! We do not have any further null values in CompetitionDistance feature.

b). Handling null values from CompetitionOpenSinceMonth and CompetitionOpenSinceYear feature.¶
In [22]:
# fetching the observations which contains null values in CompetitionOpenSinceMonth and CompetitionOpenSinceYear feature
df_store[df_store["CompetitionOpenSinceMonth"].isnull()]
Out[22]:
Store StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear PromoInterval
11 12 a c 1070.0 NaN NaN 1 13.0 2010.0 Jan,Apr,Jul,Oct
12 13 d a 310.0 NaN NaN 1 45.0 2009.0 Feb,May,Aug,Nov
15 16 a c 3270.0 NaN NaN 0 NaN NaN NaN
18 19 a c 3240.0 NaN NaN 1 22.0 2011.0 Mar,Jun,Sept,Dec
21 22 a a 1040.0 NaN NaN 1 22.0 2012.0 Jan,Apr,Jul,Oct
... ... ... ... ... ... ... ... ... ... ...
1095 1096 a c 1130.0 NaN NaN 1 10.0 2014.0 Mar,Jun,Sept,Dec
1099 1100 a a 540.0 NaN NaN 1 14.0 2011.0 Jan,Apr,Jul,Oct
1112 1113 a c 9260.0 NaN NaN 0 NaN NaN NaN
1113 1114 a c 870.0 NaN NaN 0 NaN NaN NaN
1114 1115 d c 5350.0 NaN NaN 1 22.0 2012.0 Mar,Jun,Sept,Dec

354 rows × 10 columns

time: 78 ms (started: 2024-09-07 17:51:30 +00:00)

Filling null values of CompetitionOpenSinceMonth and CompetitionOpenSinceYear with Mode.

In [23]:
# filling null values of CompetitionOpenSinceMonth
df_store["CompetitionOpenSinceMonth"].fillna(df_store["CompetitionOpenSinceMonth"].mode()[0], inplace=True)
df_store.head()
Out[23]:
Store StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear PromoInterval
0 1 c a 1270.0 9.0 2008.0 0 NaN NaN NaN
1 2 a a 570.0 11.0 2007.0 1 13.0 2010.0 Jan,Apr,Jul,Oct
2 3 a a 14130.0 12.0 2006.0 1 14.0 2011.0 Jan,Apr,Jul,Oct
3 4 c c 620.0 9.0 2009.0 0 NaN NaN NaN
4 5 a a 29910.0 4.0 2015.0 0 NaN NaN NaN
time: 78 ms (started: 2024-09-07 17:51:30 +00:00)
In [24]:
# filling null values of CompetitionOpenSinceYear
df_store["CompetitionOpenSinceYear"].fillna(df_store["CompetitionOpenSinceYear"].mode()[0], inplace=True)
df_store.tail()
Out[24]:
Store StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear PromoInterval
1110 1111 a a 1900.0 6.0 2014.0 1 31.0 2013.0 Jan,Apr,Jul,Oct
1111 1112 c c 1880.0 4.0 2006.0 0 NaN NaN NaN
1112 1113 a c 9260.0 9.0 2013.0 0 NaN NaN NaN
1113 1114 a c 870.0 9.0 2013.0 0 NaN NaN NaN
1114 1115 d c 5350.0 9.0 2013.0 1 22.0 2012.0 Mar,Jun,Sept,Dec
time: 109 ms (started: 2024-09-07 17:51:30 +00:00)
In [25]:
# rechecking if we have any null values in CompetitionOpenSinceMonth and CompetitionOpenSinceYear
df_store.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1115 entries, 0 to 1114
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Store                      1115 non-null   int64  
 1   StoreType                  1115 non-null   object 
 2   Assortment                 1115 non-null   object 
 3   CompetitionDistance        1115 non-null   float64
 4   CompetitionOpenSinceMonth  1115 non-null   float64
 5   CompetitionOpenSinceYear   1115 non-null   float64
 6   Promo2                     1115 non-null   int64  
 7   Promo2SinceWeek            571 non-null    float64
 8   Promo2SinceYear            571 non-null    float64
 9   PromoInterval              571 non-null    object 
dtypes: float64(5), int64(2), object(3)
memory usage: 87.2+ KB
time: 94 ms (started: 2024-09-07 17:51:30 +00:00)
c). Handling null values from Promo2SinceWeek, Promo2SinceYear and PromoInterval feature.¶

Since corresponding value where Promo2 is 0 for features Promo2SinceWeek, Promo2SinceYear and PromoInterval having null values. So, we are replacing null values with 0.

In [26]:
# Replacing all the null values of Promo2SinceWeek, Promo2SinceYear and PromoInterval with 0. 
df_store["Promo2SinceWeek"].fillna(0, inplace=True)
df_store["Promo2SinceYear"].fillna(0, inplace=True)
df_store["PromoInterval"].fillna(0, inplace=True)
df_store.head()
Out[26]:
Store StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear PromoInterval
0 1 c a 1270.0 9.0 2008.0 0 0.0 0.0 0
1 2 a a 570.0 11.0 2007.0 1 13.0 2010.0 Jan,Apr,Jul,Oct
2 3 a a 14130.0 12.0 2006.0 1 14.0 2011.0 Jan,Apr,Jul,Oct
3 4 c c 620.0 9.0 2009.0 0 0.0 0.0 0
4 5 a a 29910.0 4.0 2015.0 0 0.0 0.0 0
time: 78 ms (started: 2024-09-07 17:51:30 +00:00)
In [27]:
# rechecking if our features contains more null values
df_store.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1115 entries, 0 to 1114
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Store                      1115 non-null   int64  
 1   StoreType                  1115 non-null   object 
 2   Assortment                 1115 non-null   object 
 3   CompetitionDistance        1115 non-null   float64
 4   CompetitionOpenSinceMonth  1115 non-null   float64
 5   CompetitionOpenSinceYear   1115 non-null   float64
 6   Promo2                     1115 non-null   int64  
 7   Promo2SinceWeek            1115 non-null   float64
 8   Promo2SinceYear            1115 non-null   float64
 9   PromoInterval              1115 non-null   object 
dtypes: float64(5), int64(2), object(3)
memory usage: 87.2+ KB
time: 234 ms (started: 2024-09-07 17:51:31 +00:00)

Congratulations we have achieved our first milestone by cleaning up all the null/missing values from both the datasets.

2. Merging datasets¶

In [28]:
# checking shape of dataset using left join
df_left= pd.merge(df_rossmann, df_store, on="Store", how= "left")
print(df_left.shape)
print(f"Total number of null values obtained from left join: {df_left.isna().sum().sum()}")

# checking shape of dataset using right join
df_right= pd.merge(df_rossmann, df_store, on="Store", how= "right")
print(df_right.shape)
print(f"Total number of null values obtained from right join: {df_right.isna().sum().sum()}")


# checking shape of dataset using inner join
df_inner= pd.merge(df_rossmann, df_store, on="Store", how= "inner")
print(df_inner.shape)
print(f"Total number of null values obtained from inner join: {df_inner.isna().sum().sum()}")

# checking shape of dataset using outer join
df_outer= pd.merge(df_rossmann, df_store, on="Store", how= "outer")
print(df_outer.shape)
print(f"Total number of null values obtained from outer join: {df_outer.isna().sum().sum()}")
(1017209, 18)
Total number of null values obtained from left join: 0
(1017209, 18)
Total number of null values obtained from right join: 0
(1017209, 18)
Total number of null values obtained from inner join: 0
(1017209, 18)
Total number of null values obtained from outer join: 0
time: 2.3 s (started: 2024-09-07 17:51:31 +00:00)

Since, we are obtaining the same shape and 0 null values from all the joins therefore we can use any of the join and it won't affact the results.

We are following the inner join for our further analysis.

In [29]:
#using inner join for our further analysis
df= pd.merge(df_rossmann, df_store, on="Store", how= "inner")
print(df.shape)
(1017209, 18)
time: 203 ms (started: 2024-09-07 17:51:33 +00:00)
In [30]:
#checking info of our final merged dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1017209 entries, 0 to 1017208
Data columns (total 18 columns):
 #   Column                     Non-Null Count    Dtype  
---  ------                     --------------    -----  
 0   Store                      1017209 non-null  int64  
 1   DayOfWeek                  1017209 non-null  int64  
 2   Date                       1017209 non-null  object 
 3   Sales                      1017209 non-null  int64  
 4   Customers                  1017209 non-null  int64  
 5   Open                       1017209 non-null  int64  
 6   Promo                      1017209 non-null  int64  
 7   StateHoliday               1017209 non-null  object 
 8   SchoolHoliday              1017209 non-null  int64  
 9   StoreType                  1017209 non-null  object 
 10  Assortment                 1017209 non-null  object 
 11  CompetitionDistance        1017209 non-null  float64
 12  CompetitionOpenSinceMonth  1017209 non-null  float64
 13  CompetitionOpenSinceYear   1017209 non-null  float64
 14  Promo2                     1017209 non-null  int64  
 15  Promo2SinceWeek            1017209 non-null  float64
 16  Promo2SinceYear            1017209 non-null  float64
 17  PromoInterval              1017209 non-null  object 
dtypes: float64(5), int64(8), object(5)
memory usage: 139.7+ MB
time: 266 ms (started: 2024-09-07 17:51:33 +00:00)

3. Typecasting¶

In [31]:
# Checking dtypes of all the variables of the dataframe
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1017209 entries, 0 to 1017208
Data columns (total 18 columns):
 #   Column                     Non-Null Count    Dtype  
---  ------                     --------------    -----  
 0   Store                      1017209 non-null  int64  
 1   DayOfWeek                  1017209 non-null  int64  
 2   Date                       1017209 non-null  object 
 3   Sales                      1017209 non-null  int64  
 4   Customers                  1017209 non-null  int64  
 5   Open                       1017209 non-null  int64  
 6   Promo                      1017209 non-null  int64  
 7   StateHoliday               1017209 non-null  object 
 8   SchoolHoliday              1017209 non-null  int64  
 9   StoreType                  1017209 non-null  object 
 10  Assortment                 1017209 non-null  object 
 11  CompetitionDistance        1017209 non-null  float64
 12  CompetitionOpenSinceMonth  1017209 non-null  float64
 13  CompetitionOpenSinceYear   1017209 non-null  float64
 14  Promo2                     1017209 non-null  int64  
 15  Promo2SinceWeek            1017209 non-null  float64
 16  Promo2SinceYear            1017209 non-null  float64
 17  PromoInterval              1017209 non-null  object 
dtypes: float64(5), int64(8), object(5)
memory usage: 139.7+ MB
time: 343 ms (started: 2024-09-07 17:51:34 +00:00)
In [32]:
# changing dtype into required format from both the datasets
df["Date"]= pd.to_datetime(df["Date"],format="%Y-%m-%d")
df["CompetitionDistance"]= df["CompetitionDistance"].astype(int)
df["CompetitionOpenSinceMonth"]= df["CompetitionOpenSinceMonth"].astype(int)
df["CompetitionOpenSinceYear"]= df["CompetitionOpenSinceYear"].astype(int)
df["Promo2SinceWeek"]= df["Promo2SinceWeek"].astype(int)
df["Promo2SinceYear"]= df["Promo2SinceYear"].astype(int)
time: 781 ms (started: 2024-09-07 17:51:34 +00:00)
In [33]:
# Verifying the dtypes
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1017209 entries, 0 to 1017208
Data columns (total 18 columns):
 #   Column                     Non-Null Count    Dtype         
---  ------                     --------------    -----         
 0   Store                      1017209 non-null  int64         
 1   DayOfWeek                  1017209 non-null  int64         
 2   Date                       1017209 non-null  datetime64[ns]
 3   Sales                      1017209 non-null  int64         
 4   Customers                  1017209 non-null  int64         
 5   Open                       1017209 non-null  int64         
 6   Promo                      1017209 non-null  int64         
 7   StateHoliday               1017209 non-null  object        
 8   SchoolHoliday              1017209 non-null  int64         
 9   StoreType                  1017209 non-null  object        
 10  Assortment                 1017209 non-null  object        
 11  CompetitionDistance        1017209 non-null  int64         
 12  CompetitionOpenSinceMonth  1017209 non-null  int64         
 13  CompetitionOpenSinceYear   1017209 non-null  int64         
 14  Promo2                     1017209 non-null  int64         
 15  Promo2SinceWeek            1017209 non-null  int64         
 16  Promo2SinceYear            1017209 non-null  int64         
 17  PromoInterval              1017209 non-null  object        
dtypes: datetime64[ns](1), int64(13), object(4)
memory usage: 139.7+ MB
time: 156 ms (started: 2024-09-07 17:51:35 +00:00)

What all manipulations have you done and insights you found?¶

In data wrangling we have divided it into three sections:

  1. Null values treatment: In this section we have treated all the null values from both the datasets. We did the following replacements:

a. Replaced null values of CompetitionDistance with MEDIAN.

b. Replaced null values of CompetitionOpenSinceMonth and with MODE.

c. Replaced null values of CompetitionOpenSinceYear and with MODE.

d. Replaced null values of Promo2SinceWeek and with 0.

e. Replaced null values of Promo2SinceYear and with 0.

f. Replaced null values of PromoInterval and with 0.

  1. Merging datasets: We don't want to compromise with quality and quantity of our dataset in order to get the best accuracy in ML model implementation. So, we were wondering to use the best join for the good results and we got to know with our R&D that every join is giving the same shape of our merged dataset with 0 null values. So, we have decided to go with the inner join.

  2. Typecasting: In typecasting section we have typecasted the following features in order to visualize it properly and can feed them as input of ML model:

a. Typecasted the Date feature to Datetime format.

b. Typecasted the CompetitionDistance, CompetitionOpenSinceMonth, CompetitionOpenSinceYear, Promo2SinceWeek, Promo2SinceYear feature to Integer.

4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables¶

As some stores in the dataset were temporarily closed due to refurbishment and they won't provide us any information , so we will not consider those store which were closed . We will remove those rows where stores were closed .

In [34]:
# Considering those rows where stores are open and sales are not zero. 
df = df[(df['Open']==1) & (df['Sales']!=0)]            

# Now since every store in our dataset are opened , we don't need 'Open' column and we will drop this column
df.drop(['Open'],axis=1,inplace=True)
time: 187 ms (started: 2024-09-07 17:51:35 +00:00)
In [35]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 844338 entries, 0 to 1017190
Data columns (total 17 columns):
 #   Column                     Non-Null Count   Dtype         
---  ------                     --------------   -----         
 0   Store                      844338 non-null  int64         
 1   DayOfWeek                  844338 non-null  int64         
 2   Date                       844338 non-null  datetime64[ns]
 3   Sales                      844338 non-null  int64         
 4   Customers                  844338 non-null  int64         
 5   Promo                      844338 non-null  int64         
 6   StateHoliday               844338 non-null  object        
 7   SchoolHoliday              844338 non-null  int64         
 8   StoreType                  844338 non-null  object        
 9   Assortment                 844338 non-null  object        
 10  CompetitionDistance        844338 non-null  int64         
 11  CompetitionOpenSinceMonth  844338 non-null  int64         
 12  CompetitionOpenSinceYear   844338 non-null  int64         
 13  Promo2                     844338 non-null  int64         
 14  Promo2SinceWeek            844338 non-null  int64         
 15  Promo2SinceYear            844338 non-null  int64         
 16  PromoInterval              844338 non-null  object        
dtypes: datetime64[ns](1), int64(12), object(4)
memory usage: 116.0+ MB
time: 219 ms (started: 2024-09-07 17:51:35 +00:00)

Now we will do data visualization in an structured way following ' UBM ' rule

  • Univariate Analysis

  • Bivariate Analysis

  • Multivariate Analysis

Univariate Analysis¶

Chart-1: Checking frequency distribution of continous features-Sales, CompetitionDistance, Customers:¶

In [36]:
#Checking Frequency distribution for continous features:
plt.figure(figsize=(20,10))
features = ["Sales","CompetitionDistance","Customer"]
for feature in features:
    pass

#First plot(Sales vs Frequency)
plt.subplot(2,2,1)
plt.xlabel("Sales")
plt.ylabel("Frequency")
sns.kdeplot(df["Sales"], color="Green", shade = True)  #kernel density estimate (KDE) plot
plt.title('Density distribution of Sales',size = 20)

#Second plot(CompetitionDistance vs Frequency)
plt.subplot(2,2,2)
plt.xlabel("CompetitionDistance")
plt.ylabel("Frequency")
sns.kdeplot(df["CompetitionDistance"], color="Blue", shade = True) #kernel density estimate (KDE) plot
plt.title('Density distribution of CompetitionDistance',size = 20)

#Third plot(Customers vs Frequency)
plt.subplot(2,2,3)
plt.xlabel("Customers")
plt.ylabel("Frequency")
sns.kdeplot(df["Customers"], color="Red", shade = True) #kernel density estimate (KDE) plot
plt.title('Density distribution of Customers',size = 20)
Out[36]:
Text(0.5, 1.0, 'Density distribution of Customers')
No description has been provided for this image
time: 11.9 s (started: 2024-09-07 17:51:35 +00:00)
1. Why did you pick the specific chart?¶

We picked this chart as it shows whether the observations are high or low and also whether they are concentrated in one area or spread out across the entire scale for continous features only.

2. What is/are the insight(s) found from the chart?¶
  1. Average Sales value is under 5000 and graph is rightly skewed, which shows most of the stores are open where the middle class resides.

  2. We can clearly observe that most of the stores have their competition within 5Km range which indicates mostly competiting stores are located nearby to each other.

  3. Avarage number of customer visiting stores is 700.

3. Will the gained insights help creating a positive business impact?¶

Are there any insights that lead to negative growth? Justify with specific reason.

  1. As we can see most of the sales are under 1000, to expand their business and sales they should open their stores in posh area or where the upper class lives.

  2. We plotted frequency distribution graph for sales, customer and competition distance with the help of which we came to know that many competiting stores are densly located. so, business should plan different strategies to sustain in the competitive market.

Chart-2: Checking frequency distribution of continous features-SchoolHoliday, Promo, Promo2:¶

In [37]:
# Setting up the subplots with 1 row and 3 columns
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Plot 1: Percentage of School Holiday
labels = 'Not SchoolHoliday', 'SchoolHoliday'
sizes = df.SchoolHoliday.value_counts()
colors = ['pink', 'yellow']
explode = (0.2, 0.0)

axes[0].pie(sizes, explode=explode, labels=labels, colors=colors,
            autopct='%1.1f%%', shadow=True)
axes[0].axis('equal')
axes[0].set_title("Percentage of School Holiday", fontsize=16)
axes[0].legend(labels, loc="best")

# Plot 2: Percentage of Promo
labels = 'Promo', 'Not Promo'
sizes = df.Promo.value_counts()
colors = ['lightblue', 'lightgreen']
explode = (0.2, 0.0)

axes[1].pie(sizes, explode=explode, labels=labels, colors=colors,
            autopct='%1.1f%%', shadow=True)
axes[1].axis('equal')
axes[1].set_title("Percentage of Promoted Stores", fontsize=16)
axes[1].legend(labels, loc="best")

# Plot 3: Percentage of Promo2
labels = 'Promo2', 'Not Promo2'
sizes = df.Promo2.value_counts()
colors = ['violet', 'lightcoral']
explode = (0.2, 0.0)

axes[2].pie(sizes, explode=explode, labels=labels, colors=colors,
            autopct='%1.1f%%', shadow=True)
axes[2].axis('equal')
axes[2].set_title("Percentage of Continuous Promotion Stores", fontsize=16)
axes[2].legend(labels, loc="best")

# Adjust layout for better spacing
plt.tight_layout()

# Display the subplots
plt.show()
No description has been provided for this image
time: 672 ms (started: 2024-09-07 17:51:47 +00:00)
1. Why did you pick the specific chart?¶

We choose the pie chart as it represents the contribution of each part of the data to a whole where the arc size of each slice is directly proportional to the contribution of that part.

2. What is/are the insight(s) found from the chart?¶

1.From first pie chart We can say that market will observe approx 20% of school Holidays.

2.We see that 44.6% of stores are into promotions and 55.4% of stores are not into promotions .The reason that more stores are not into promotions may be they don't have the enough budget or they may not see enough need for promotions.

3.From Third chart we can infer that there is almost equal percentage of stores getting promoted and the one which are not getting promoted.

3. Will the gained insights help creating a positive business impact?¶

Are there any insights that lead to negative growth? Justify with specific reason.

  1. From above pie chart we understood that percentage of stores getting promoted is more than non promoting stores but sales are positively correlated to promo meaning if stores are getting promoted sale is increasing. So, business should try promoting the stores to increase the revenue.
  2. Later we will see if their is any impact of School holiday on sale on sale.

Chart-3: Checking frequency distribution of Storetype, Assortment, Stateholiday and Promointervals:¶

In [38]:
# Store type and assortment analysis
plt.figure(figsize=(13, 12))

# Plot 1: Store Type
plt.subplot(2, 2, 1)
ax1 = sns.countplot(x=df['StoreType'], palette='Set2', edgecolor='black')
for p in ax1.patches:
    height = p.get_height()
    ax1.annotate('{:.2f}%'.format(100 * height / len(df['StoreType'])), 
                 (p.get_x() + p.get_width() / 2, height + 0.5),
                 ha="center", fontsize=15)
ax1.set_title('Store Type Distribution', fontsize=16)

# Plot 2: Assortment
plt.subplot(2, 2, 2)
ax2 = sns.countplot(x=df['Assortment'], palette='Set2', edgecolor='black')
for p in ax2.patches:
    height = p.get_height()
    ax2.annotate('{:.2f}%'.format(100 * height / len(df['Assortment'])), 
                 (p.get_x() + p.get_width() / 2, height + 0.5),
                 ha="center", fontsize=15)
ax2.set_title('Assortment Type Distribution', fontsize=16)

# Plot 3: State Holiday
plt.subplot(2, 2, 3)
ax3 = sns.countplot(x=df['StateHoliday'], palette='Set2', edgecolor='black')
for p in ax3.patches:
    height = p.get_height()
    ax3.annotate('{:.2f}%'.format(100 * height / len(df['StateHoliday'])), 
                 (p.get_x() + p.get_width() / 2, height + 0.5),
                 ha="center", fontsize=15)
ax3.set_title('State Holiday Distribution', fontsize=16)

# Plot 4: Promo Interval
plt.subplot(2, 2, 4)
ax4 = sns.countplot(x=df['PromoInterval'], palette='Set2', edgecolor='black')
for p in ax4.patches:
    height = p.get_height()
    ax4.annotate('{:.2f}%'.format(100 * height / len(df['PromoInterval'])), 
                 (p.get_x() + p.get_width() / 2, height + 0.5),
                 ha="center", fontsize=16)
ax4.set_title('Promo Interval Distribution', fontsize=16)

# Adjust layout and display
plt.tight_layout()
plt.show()
No description has been provided for this image
time: 7.94 s (started: 2024-09-07 17:51:48 +00:00)
1. Why did you pick the specific chart?¶

We plotted above graph to know the distribution of different type of stores,assortment,state holiday and promointerval in the dataset.

2. What is/are the insight(s) found from the chart?¶
  1. Store Type "a" is most common and Store Type "b" is rare in the dataset .

  2. Most of the stores have assortment "a" and assortment "b" is very rare . Since there is significant diffference in the number of occurrence of different store types and assortments , it could indicate that certain store types and assortments are more successful or profitable than others .It could also suggest that certain store types and assortments are more suited to certain types of stores or locations.

  3. We can infer that many stores are planning to get prmoted in first quarter of promointerval while number of stores getting promoted in successive quartes is decreasing gradually.

3. Will the gained insights help creating a positive business impact?¶

Are there any insights that lead to negative growth? Justify with specific reason.

As we can see that store type_a , assortment_a is highest and storetype_b , assortment_b is lowest in number. So it's quite intersting to see weather these assortments and stores will also get the heighest sale in bivariate analysis also!!!

Chart-4: Checking frequency distribution of Promo2:¶

In [39]:
# # Promo 2 since weeks
# plt.figure(figsize=(20,8))

# ax=sns.countplot(df['Promo2SinceWeek'],palette='Set2',edgecolor='black')
# for p in ax.patches:
#    ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x()+0.25, p.get_height()+0.01))

# plt.show()
time: 0 ns (started: 2024-09-07 17:51:56 +00:00)
1. Why did you pick the specific chart?¶

To see since how many weeks the stores have been praticipating in Promo2 .

2. What is/are the insight(s) found from the chart?¶

In this countplot, we can see that 0 has the greatest count . 0 indicates the stores are not participating in promo2 and rest all count indicate the numbers of count of stores , since how many week they have been participating in promo 2.One possibility that stores are not continuing participating in the promo might be customers are less responsive to a second promotion, as they may have already taken advantage of a similar promotion earlier. Another possibility is that the store may not have invested enough in promoting the promotion to customers, resulting in lower awareness and fewer sales.

3. Will the gained insights help creating a positive business impact?¶

Are there any insights that lead to negative growth? Justify with specific reason.

From above graph we can infer that many stores are not continuing participating in the promo which indicates the promotions are not meeting the required targets.

Bivariate Analysis¶

Chart-1: Sales vs DayOfWeek¶

In [47]:
# Sales Vs DayOfWeek
plt.figure(figsize=(13,7))
plots=sns.barplot(x=df['DayOfWeek'],y=df['Sales'],edgecolor='black')
for bar in plots.patches:
    plots.annotate(format(bar.get_height(), '.2f'),
                   (bar.get_x() + bar.get_width() / 2,
                    bar.get_height()), ha='center', va='center',
                   size=15, xytext=(0, 8),
                   textcoords='offset points' , annotation_clip=True)
plt.show()
No description has been provided for this image
time: 13.3 s (started: 2024-09-07 17:55:08 +00:00)
1. Why did you pick the specific chart?¶

We used bar plots because they are a useful tool for visualizing and understanding categorical data, and can be an effective way to communicate information to the wide audience.

2. What is/are the insight(s) found from the chart?¶

Day 1 and day 7 witness the highest sale indicating they are probably days falling on the weekend. Day 2 to day 6 generate medium to low sales indicating they are probably weekdays where customer footfall is low.

3. Will the gained insights help creating a positive business impact?¶

Are there any insights that lead to negative growth? Justify with specific reason.

  1. Day 1 and 7 have good number of sales so we can hire more staff specially for these days and delivery boys for more revenue.
  2. Since only 2 days are witnessing good sales, store can target remaining 5 days in a week to milk more revenue. It can run exciting offers and attract more customers on weekdays to generate more revenue.

Chart-2: Sales vs Year¶

In [48]:
# Sales vs Year
plt.figure(figsize=(13,7))
plots=sns.barplot(x=df["Date"].dt.year,y=df['Sales'],edgecolor='black')
for bar in plots.patches:
    plots.annotate(format(bar.get_height(), '.2f'),
                   (bar.get_x() + bar.get_width() / 2,
                    bar.get_height()), ha='center', va='center',
                   size=15, xytext=(0, 8),
                   textcoords='offset points')

plt.show()
No description has been provided for this image
time: 16.9 s (started: 2024-09-07 17:55:42 +00:00)
1. Why did you pick the specific chart?¶

We used this plot to visualize the distribution of sales in 2013, 2014 and 2015. Count plots can help us understand how the values of a variable are distributed within the dataset.

2. What is/are the insight(s) found from the chart?¶

For any business, year-on-year sales growth is a good parameter to access the store growth. Here we can observe that even though sales are increasing year-on-year, sales growth is not even 10%. So the owners need to put extra efforts to increase the yearly sales.

3. Will the gained insights help creating a positive business impact?¶

Are there any insights that lead to negative growth? Justify with specific reason.

Yes. The insights from this graph are actually very important for the owners. Since the growth is not at very commendable pace, they need to target more customers and bring some changes in the operations.

Chart-3: Sales vs Month¶

In [74]:
# Month vs sales
plt.figure(figsize=(18,7))
plots=sns.barplot(x=df['Date'].dt.month,y=df['Sales'],edgecolor='black')
for bar in plots.patches:
    plots.annotate(format(bar.get_height(), '.2f'),
                   (bar.get_x() + bar.get_width() / 2,
                    bar.get_height()), ha='center', va='center',
                   size=15, xytext=(0, 8),
                   textcoords='offset points')
plt.xlabel("Month")


plt.figure(figsize=(15,6))
sns.pointplot(x= df['Date'].dt.month, y= df['Sales'],data=df,hue="StoreType")
plt.title('Plot between Sales and Competition Open Since year')

plt.show()
No description has been provided for this image
No description has been provided for this image
time: 18 s (started: 2024-09-07 18:34:52 +00:00)
1. Why did you pick the specific chart?¶

To know the patterns or trends in the data, such as a peak in activity or sales during a particular month of the year.

2. What is/are the insight(s) found from the chart?¶

The countplot highlights that December being a festive month attracts more sale than the rest of the months. Also, November has slightly more sales than other months. This could be due to the 'Black Friday' sale which is very popular across the globe. As Rossmann Stores deals in health and beauty products, it can be guessed that November and December sales are due to the celebratory nature of people who love to buy beauty/health products leading to the sudden increase in sales.

3. Will the gained insights help creating a positive business impact?¶

Are there any insights that lead to negative growth? Justify with specific reason.

Yes, definitely these insights create a positive business impact as business owners will try to keep more goods to cater with the business needs and also they can increase the revenue by keeping stores open even on weekends or holidays as customers are aiming to buy more in this period of month.

Chart-4: Sales vs Customer¶

In [56]:
# Sale Vs Customer
plt.figure(figsize=(13,7))
sns.scatterplot(data=df,x=df['Customers'],y=df['Sales'],hue=df["StoreType"])
# df.columns
Out[56]:
<Axes: xlabel='Customers', ylabel='Sales'>
No description has been provided for this image
time: 32.1 s (started: 2024-09-07 18:04:27 +00:00)
1. Why did you pick the specific chart?¶

We chose scatter plot because we wanted to plot the relationship between the number of customers visiting the store and the total sales.They are useful for identifying outliers in the data, as well as to determine the correlation between two variables,and to verify the linear trend of our dataset.

2. What is/are the insight(s) found from the chart?¶

This above scatterplot show a positive correlation between 'Sales' and 'Customers'. As the number of customers increases, the sales also tend to increase. We understood from the above trend that our data shows linear trend between these two variables - customers and sales.

3. Will the gained insights help creating a positive business impact?¶

Are there any insights that lead to negative growth? Justify with specific reason.

Customers and sales are directly proportional to each other and are highly correlated. It can help businesses to promote their strategies using marketing campaigns, advertisements to attract more customers ultimately shooting up the sales.

Chart-5: Sales vs Promo¶

In [55]:
# Sales vs Promo
plt.figure(figsize=(7,7))
plots=sns.barplot(x=df['Promo'],y=df['Sales'],edgecolor='black')
for bar in plots.patches:
    plots.annotate(format(bar.get_height(), '.2f'),
                   (bar.get_x() + bar.get_width() / 2,
                    bar.get_height()), ha='center', va='center',
                   size=15, xytext=(0, 8),
                   textcoords='offset points')
plt.show()
No description has been provided for this image
time: 14.5 s (started: 2024-09-07 18:03:57 +00:00)
1. Why did you pick the specific chart?¶

To know the count of how many sales were made at stores that were running promotions and how many were made at stores that were not running promotions. We used bar plots because they are a useful tool for visualizing and understanding categorical data, and can be an effective way to communicate information to the wide audience.

2. What is/are the insight(s) found from the chart?¶

From the above plot we see the effectiveness of promotions on increasing sales. So we can infer that as the stores are getting promoted, sales are increasing on large basis showing positive correlation between promo and sales.

3. Will the gained insights help creating a positive business impact?¶

Are there any insights that lead to negative growth? Justify with specific reason.

Business owners should try to promote the stores to sustain in the market eventually resulting increment in the sales amount.

Chart-6: Sales vs StateHoliday¶

In [ ]:
# Sales vs StateHoliday
plt.figure(figsize=(7,7))
plots=sns.barplot(x=df['StateHoliday'],y=df['Sales'],edgecolor='black')
for bar in plots.patches:
    plots.annotate(format(bar.get_height(), '.2f'),
                   (bar.get_x() + bar.get_width() / 2,
                    bar.get_height()), ha='center', va='center',
                   size=15, xytext=(0, 8),
                   textcoords='offset points')
plt.show()
1. Why did you pick the specific chart?¶

To know the number of sales for various State Holidays .

2. What is/are the insight(s) found from the chart?¶

From the above plot, we observe that sales are highest for state holiday 'B' followed by state holiday 'C'. One interesting insight here is sales are comparitively low during normal days(i.e with no state holiday).The factors that contribute to higher sales on these holidays could be increased consumer spending, special promotions or events.

3. Will the gained insights help creating a positive business impact?¶

Are there any insights that lead to negative growth? Justify with specific reason.

The business can create special offers or bundle products to increase the sales during the religious festivals.

Chart-7: Sales vs SchoolHoliday¶

In [57]:
# Sales vs SchoolHoliday

plt.figure(figsize=(7,7))

plots=sns.barplot(x=df['SchoolHoliday'],y=df['Sales'],edgecolor='black')
for bar in plots.patches:
    plots.annotate(format(bar.get_height(), '.2f'),
                   (bar.get_x() + bar.get_width() / 2,
                    bar.get_height()), ha='center', va='center',
                   size=15, xytext=(0, 8),
                   textcoords='offset points')    
plt.show()
No description has been provided for this image
time: 20 s (started: 2024-09-07 18:05:31 +00:00)
1. Why did you pick the specific chart?¶

To know the count of how many sales were made at stores on school holiday and on non-school holiday .

2. What is/are the insight(s) found from the chart?¶

We can see there is not much difference in sales. However,sales is more on school holidays . It is possible that school holidays are more likely to be associated with families going on vacation or parents taking time off work to spend with their children, which could lead to increase in consumer spendings.

3. Will the gained insights help creating a positive business impact?¶

Are there any insights that lead to negative growth? Justify with specific reason.

As we can see from the graph, it is not making much difference whether there is school holiday or not. Still, businesses can target school holidays and run more promotional offers.

Chart-8: Sales vs StoreType¶

In [58]:
# Sales vs StoreType
plt.figure(figsize=(7,7))
plots=sns.barplot(x=df['StoreType'],y=df['Sales'],edgecolor='black')
for bar in plots.patches:
    plots.annotate(format(bar.get_height(), '.2f'),
                   (bar.get_x() + bar.get_width() / 2,
                    bar.get_height()), ha='center', va='center',
                   size=15, xytext=(0, 8),
                   textcoords='offset points')
plt.show()
No description has been provided for this image
time: 17.2 s (started: 2024-09-07 18:05:51 +00:00)
1. Why did you pick the specific chart?¶

To know which kind of store is able to generate maximum profit.

2. What is/are the insight(s) found from the chart?¶
  1. Sales for the store type b is the highest . Store type B might be located in a more affluent or high-traffic area, which would increase the number of potential customers. Store type B may have a more favorable layout, which makes it more attractive to customers and makes it easier for them to find the products they want, resulting in more sales.

  2. Earlier we have drawn the univeriate graph of Store type to know which type of store are more in numbers, and we found that store type a,c,d are more but this is not the case with the sales even though storetype_b are less still they are making more profit.

  3. It might be the case that store type_b contain costly, luxerious items.

3. Will the gained insights help creating a positive business impact?¶

Are there any insights that lead to negative growth? Justify with specific reason.

  1. Since store type 'a','c' and 'd' are generating similar sales and lower than store type 'b', they can follow the business strategies, marketing tactics of store type 'b'.

  2. Company should open stores of type_b to gain more profit.

Chart-9: Sales vs Assortment¶

In [59]:
# Sales vs Assortment
plt.figure(figsize=(7,7))
plots=sns.barplot(x=df['Assortment'],y=df['Sales'],edgecolor='black')
for bar in plots.patches:
    plots.annotate(format(bar.get_height(), '.2f'),
                   (bar.get_x() + bar.get_width() / 2,
                    bar.get_height()), ha='center', va='center',
                   size=15, xytext=(0, 8),
                   textcoords='offset points')
plt.show()
No description has been provided for this image
time: 17.2 s (started: 2024-09-07 18:06:08 +00:00)
1. Why did you pick the specific chart?¶

To know the Sales for various Assortments or which assortment is beneficial for more profit.

2. What is/are the insight(s) found from the chart?¶
  1. Earlier we have seen that assortment_B are less in numbers compared to a,b,c but with the help of Bivariate graph we can see that assortment_B caters the maximum amount of sales.

  2. Sales are highest for the assortment b . This assortment may have a good mix of products that are in high demand or that are unique to the store, which would result in more sales.

3. Will the gained insights help creating a positive business impact?¶

Are there any insights that lead to negative growth? Justify with specific reason.

Since sales are highest for assortment 'b', Rossman owners can target more on this specific combination and reduce the dependency on assortment 'a' and assortment 'c'. This could surely be a cost effective move.

Chart-10: Sales vs CompetitionDistance¶

In [65]:
# Sales Vs CompetitionDistance
plt.figure(figsize=(13,7))
sns.scatterplot(data=df, x=df['CompetitionDistance'],y=df['Sales'],hue="StoreType")
Out[65]:
<Axes: xlabel='CompetitionDistance', ylabel='Sales'>
No description has been provided for this image
time: 38.2 s (started: 2024-09-07 18:08:17 +00:00)
1. Why did you pick the specific chart?¶

To know the relationship between the sales and competition distance.

2. What is/are the insight(s) found from the chart?¶

From the above scatter plot it can be observed that mostly the competitor stores weren't that far from each other and the stores densely located near each other saw more sales.

3. Will the gained insights help creating a positive business impact?¶

Are there any insights that lead to negative growth? Justify with specific reason.

As we can see sales are more for densely located stores.So, stores can continue providing exciting offers and services to attract customers to compete the market.

Chart-11: Sales vs CompetitionOpenSinceYear¶

In [71]:
# Sale Vs CompetitionOpenSinceYear
plt.figure(figsize=(15,6))
sns.pointplot(x= 'CompetitionOpenSinceYear', y= 'Sales', data=df,hue="StoreType")
plt.title('Plot between Sales and Competition Open Since year')
Out[71]:
Text(0.5, 1.0, 'Plot between Sales and Competition Open Since year')
No description has been provided for this image
time: 1min 40s (started: 2024-09-07 18:21:22 +00:00)
1. Why did you pick the specific chart?¶

To know the average sales for each year since a competitor opened near the store .

2. What is/are the insight(s) found from the chart?¶

From the Plot we can tell that Sales are high during the year 1900, as there are very few store were operated of Rossmann so there is less competition and sales are high. But as year pass on number of stores increased that means competition also increased and this leads to decline in the sales.

3. Will the gained insights help creating a positive business impact?¶

Are there any insights that lead to negative growth? Justify with specific reason.

Since almost every subsequent year reports sudden drop/rise in the sales, owners need to work on their stockings and marketing tactics. The basic reason for this sudden change could be explored more deeply and a plausible solution to it can be reached.

Chart-12: Sales vs Promo2¶

In [67]:
# Sales vs Promo2

plt.figure(figsize=(7,7))

plots=sns.barplot(x=df['Promo2'],y=df['Sales'],edgecolor='black')
for bar in plots.patches:
    plots.annotate(format(bar.get_height(), '.2f'),
                   (bar.get_x() + bar.get_width() / 2,
                    bar.get_height()), ha='center', va='center',
                   size=15, xytext=(0, 8),
                   textcoords='offset points')    
plt.show()
No description has been provided for this image
time: 15.5 s (started: 2024-09-07 18:09:08 +00:00)
1. Why did you pick the specific chart?¶

To know the count of sales with presence and absence of promo2.

2. What is/are the insight(s) found from the chart?¶

The barplot shows that customers are slightly less responsive to the stores(i.e sales) that are running consecutive promotions. One possibility could be customers might have already taken advantage of a similar promotion earlier. Another reason could be store might not have invested enough in promoting the promotion to customers, resulting in lower awareness and fewer sales. Also, if the store is running same promotion again and again, it could have resulted into lower customer footfal and ultimately leadind to fewer sales.

3. Will the gained insights help creating a positive business impact?¶

Are there any insights that lead to negative growth? Justify with specific reason.

Yes. The insights are indeed helpful as they are highlighting one of the most important thing about consecutive promotions. Clearly, such promotions are not enough to convert customers into buyers. The store needs to come up with more innovative and more rewardful solutions.

Chart-13: Sales vs Promo2SinceYear¶

In [68]:
# Sales vs Promo2SinceYear

plt.figure(figsize=(10,7))

plots=sns.barplot(x=df['Promo2SinceYear'],y=df['Sales'],edgecolor='black')
for bar in plots.patches:
    plots.annotate(format(bar.get_height(), '.2f'),
                   (bar.get_x() + bar.get_width() / 2,
                    bar.get_height()), ha='center', va='center',
                   size=15, xytext=(0, 8),
                   textcoords='offset points')    
plt.show()
No description has been provided for this image
time: 14.2 s (started: 2024-09-07 18:09:23 +00:00)
1. Why did you pick the specific chart?¶

To know the sale count for various years since the promo2 started.

2. What is/are the insight(s) found from the chart?¶

This barplot explains that sales were still the highest when the store wasn't running any consecutive promotional events. But in 2014, the sales were really shoot up and they are recorded as 2nd highest. Good quality products, better deals, shutdown of competitions etc could be the reasons.

3. Will the gained insights help creating a positive business impact?¶

Are there any insights that lead to negative growth? Justify with specific reason.

Yes. The insights are indeed helpful as they are highlighting one of the most important thing about consecutive promotions. Clearly, such promotions are not enough to convert customers into buyers. The store needs to come up with more innovative and more rewardful solutions.

Chart-14: Sales vs PromoInterval¶

In [69]:
# Sales vs PromoInterval

plt.figure(figsize=(7,7))

plots=sns.barplot(x=df['PromoInterval'],y=df['Sales'],edgecolor='black')
for bar in plots.patches:
    plots.annotate(format(bar.get_height(), '.2f'),
                   (bar.get_x() + bar.get_width() / 2,
                    bar.get_height()), ha='center', va='center',
                   size=15, xytext=(0, 8),
                   textcoords='offset points')    
plt.show()
No description has been provided for this image
time: 16.6 s (started: 2024-09-07 18:09:38 +00:00)
1. Why did you pick the specific chart?¶

To know the count of sale for various promo interval .

2. What is/are the insight(s) found from the chart?¶

This barplot explains that sales were still the highest when the store wasn't running any consecutive promotional events. Here, we can see the promo interval Jan, Apr, Jul, Oct records the 2nd highest sales as it marks the festive season. However, the other intervals are recording sales that are close to the 1st interval.

3. Will the gained insights help creating a positive business impact?¶

Are there any insights that lead to negative growth? Justify with specific reason.

Business needs to design the pattern of the promotion as sales are decreasing gradually from begining with first quarter of promo interval.

Multivariate Analysis¶

Chart-1 Pair Plot¶

In [70]:
# # Pair Plot visualization code
sns.pairplot(df) 
Out[70]:
<seaborn.axisgrid.PairGrid at 0x289a4bfb9d0>
No description has been provided for this image
time: 9min 35s (started: 2024-09-07 18:09:54 +00:00)
1. Why did you pick the specific chart?¶

It can give multiple visual aids in a single frame and various insights related to the data can be gained in one single look.

2. What is/are the insight(s) found from the chart?¶

Pairplot helped us to visualize the relationship between sales and other variables, such as customers, Promotions, competition, and school holidays Columns. This helped us to identify which variables might be useful for predicting sales and inform the design for our machine learning model.

3. Will the gained insights help creating a positive business impact?¶

Are there any insights that lead to negative growth? Justify with specific reason.

Yes.By plotting pairplt we got to know which features are impacting more on sales aiming for maximum sales.

Chart-2 Correlation Heatmap¶

In [78]:
# plt.figure(figsize = (15,10))
# sns.heatmap(df.corr(),annot= True,cmap=sns.color_palette('Pastel1'),square=True)
# plt.title('Correlation Heatmap for Playstore data and User review data', size=20)
df.head()
Out[78]:
Store DayOfWeek Date Sales Customers Promo StateHoliday SchoolHoliday StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear PromoInterval
0 1 5 2015-07-31 5263 555 1 0 1 c a 1270 9 2008 0 0 0 0
1 2 5 2015-07-31 6064 625 1 0 1 a a 570 11 2007 1 13 2010 Jan,Apr,Jul,Oct
2 3 5 2015-07-31 8314 821 1 0 1 a a 14130 12 2006 1 14 2011 Jan,Apr,Jul,Oct
3 4 5 2015-07-31 13995 1498 1 0 1 c c 620 9 2009 0 0 0 0
4 5 5 2015-07-31 4822 559 1 0 1 a a 29910 4 2015 0 0 0 0
time: 406 ms (started: 2024-09-07 18:37:15 +00:00)

5. Hypothesis Testing¶

Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.¶

1- Null Hypothesis - There is no relation between Customers and Sales

Alternate Hypothesis - There is a relationship between Customers and sales

2- Null Hypothesis - There is no relation between DayOfWeek and Sales

Alternate Hypothesis - There is a relation between DayOfWeek and Sales

3- Null Hypothesis - There is no relation between SchoolHoliday and Sales

Alternate Hypothesis - There is a relation between SchoolHoliday and Sales

Hypothetical Statement - 1¶

1. State Your research hypothesis as a null hypothesis and alternate hypothesis.¶

Null Hypothesis - There is no relation between Customers and Sales

Alternate Hypothesis - There is a relationship between Customers and sales

2. Perform an appropriate statistical test.¶

In [79]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import pearsonr
first_sample = df["Customers"].head(60)
second_sample = df["Sales"].head(60)

stat, p = pearsonr(first_sample, second_sample)
print('stat=%.3f, p = %.2f'%(stat, p))
if p> 0.05:
  print('Accept Null Hypothesis')
else:
  print('Rejected Null Hypothesis')
stat=0.840, p = 0.00
Rejected Null Hypothesis
time: 375 ms (started: 2024-09-07 18:40:38 +00:00)
Which statistical test have you done to obtain P-Value?¶

We have used Pearson Correlation test to obtain P-Value along with Pearson Correlation coefficient value.It is a measure of linear correlation between two sets of data.

Why did you choose the specific statistical test?¶

We want to check the relationship between two features if they are positively or negatively correlated.P-value and Pearson Correlation coefficient will always have a value between -1 and 1.Here we can see that after applying test on Customers and sales features we got Correlation coefficient as 0.939 which implies that theses two features are having strong positive correlation between them.

Hypothetical Statement - 2¶

1. State Your research hypothesis as a null hypothesis and alternate hypothesis.¶

Null Hypothesis - There is no relation between DayOfWeek and Sales

Alternate Hypothesis - There is a relationship between DayOfWeek and sales

2. Perform an appropriate statistical test.¶

In [80]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import pearsonr
first_sample = df["DayOfWeek"].head(60)
second_sample = df["Sales"].head(60)

stat, p = pearsonr(first_sample, second_sample)
print('stat=%.3f, p = %.2f'%(stat, p))
if p> 0.05:
  print('Accept Null Hypothesis')
else:
  print('Rejected Null Hypothesis')
stat=nan, p = nan
Rejected Null Hypothesis
time: 110 ms (started: 2024-09-07 18:40:48 +00:00)
Which statistical test have you done to obtain P-Value?¶

We have used Pearson Correlation test to obtain P-Value along with Pearson Correlation coefficient value.It is a measure of linear correlation between two sets of data.

Why did you choose the specific statistical test?¶

We want to check the relationship between two features if they are positively or negatively correlated.P-value and Pearson Correlation coefficient will always have a value between -1 and 1.Here we can see that after applying test on DayOfWeek and sales features we got Correlation coefficient as -0.221 which implies that theses two features are having weak negative correlation between them.

Hypothetical Statement - 3¶

1. State Your research hypothesis as a null hypothesis and alternate hypothesis.¶

Null Hypothesis - There is no relation between SchoolHoliday and Sales

Alternate Hypothesis - There is a relationship between SchoolHoliday and sales

2. Perform an appropriate statistical test.¶

In [82]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import pearsonr
first_sample = df["SchoolHoliday"].head(60)
second_sample = df["Sales"].head(60)

stat, p = pearsonr(first_sample, second_sample)
print('stat=%.3f, p = %.2f'%(stat, p))
if p> 0.05:
  print('Accept Null Hypothesis')
  print(stat)
else:
  print('Rejected Null Hypothesis')
stat=0.058, p = 0.66
Accept Null Hypothesis
0.05831224843362071
time: 47 ms (started: 2024-09-07 18:41:40 +00:00)
Which statistical test have you done to obtain P-Value?¶

We have used Pearson Correlation test to obtain P-Value along with Pearson Correlation coefficient value.It is a measure of linear correlation between two sets of data.

Why did you choose the specific statistical test?¶

We want to check the relationship between two features if they are positively or negatively correlated.P-value and Pearson Correlation coefficient will always have a value between -1 and 1.Here we can see that after applying test on Customers and sales features we got Correlation coefficient as 0.334 which implies that theses two features are having weak positive correlation between them.

In [ ]:
 
In [ ]:
 

6. Feature Engineering & Data Pre-processing¶

1. Handling Missing Values¶

In [83]:
# Handling Missing Values & Missing Value Imputation
df.isna().sum()
Out[83]:
Store                        0
DayOfWeek                    0
Date                         0
Sales                        0
Customers                    0
Promo                        0
StateHoliday                 0
SchoolHoliday                0
StoreType                    0
Assortment                   0
CompetitionDistance          0
CompetitionOpenSinceMonth    0
CompetitionOpenSinceYear     0
Promo2                       0
Promo2SinceWeek              0
Promo2SinceYear              0
PromoInterval                0
dtype: int64
time: 797 ms (started: 2024-09-07 18:41:50 +00:00)

As we have already treated null values so we do not have any more missing/null/duplicate values in our dataset and our dataset is good to go.

What all missing value imputation techniques have you used and why did you use those techniques?¶

We have checked the outliers by plotting the box plot and then replaced the null values of various variables with mean, median,mode and 0 accordingly.

2. Handling Outliers¶

In [84]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 844338 entries, 0 to 1017190
Data columns (total 17 columns):
 #   Column                     Non-Null Count   Dtype         
---  ------                     --------------   -----         
 0   Store                      844338 non-null  int64         
 1   DayOfWeek                  844338 non-null  int64         
 2   Date                       844338 non-null  datetime64[ns]
 3   Sales                      844338 non-null  int64         
 4   Customers                  844338 non-null  int64         
 5   Promo                      844338 non-null  int64         
 6   StateHoliday               844338 non-null  object        
 7   SchoolHoliday              844338 non-null  int64         
 8   StoreType                  844338 non-null  object        
 9   Assortment                 844338 non-null  object        
 10  CompetitionDistance        844338 non-null  int64         
 11  CompetitionOpenSinceMonth  844338 non-null  int64         
 12  CompetitionOpenSinceYear   844338 non-null  int64         
 13  Promo2                     844338 non-null  int64         
 14  Promo2SinceWeek            844338 non-null  int64         
 15  Promo2SinceYear            844338 non-null  int64         
 16  PromoInterval              844338 non-null  object        
dtypes: datetime64[ns](1), int64(12), object(4)
memory usage: 116.0+ MB
time: 375 ms (started: 2024-09-07 18:41:56 +00:00)
In [85]:
# assigning continous variable features in new variables so that it makes sense while visulatizing through box plots
continous_value_feature= ["DayOfWeek", "Sales", "Customers", "CompetitionDistance", "CompetitionOpenSinceMonth", "CompetitionOpenSinceYear", "Promo2SinceWeek", "Promo2SinceYear"]
numeric_features= ['Store', 'DayOfWeek', 'Sales', 'Customers', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek','Promo2SinceYear']
categorical_features= ["Date", "StoreType", "Assortment", "PromoInterval"]
print("Numeric_features: ",numeric_features)
print("Categorical_features: ",categorical_features)
print("Continous_value_feature: ",continous_value_feature)
Numeric_features:  ['Store', 'DayOfWeek', 'Sales', 'Customers', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear']
Categorical_features:  ['Date', 'StoreType', 'Assortment', 'PromoInterval']
Continous_value_feature:  ['DayOfWeek', 'Sales', 'Customers', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2SinceWeek', 'Promo2SinceYear']
time: 0 ns (started: 2024-09-07 18:42:07 +00:00)
In [87]:
# checking outliers with the help of box plot for continous features
plt.figure(figsize=(30,30))
for n,column in enumerate(continous_value_feature):
  plt.subplot(5, 4, n+1)
  sns.boxplot(df[column])
  plt.title(f'{column.title()}',weight='bold')
  plt.tight_layout()
No description has been provided for this image
time: 12.4 s (started: 2024-09-07 18:42:34 +00:00)

From the above box plots we can see clearly the features "Sales", "Customers", "CompetitionDistance", "CompetitionOpenSinceMonth" and "CompetitionOpenSinceYear" contains several outliers and rest of the features are fine as they are categorical in nature.

Let's define a code to detect the number of outliers and percentage of outliers present in each of the feature in order to handle them accordingly.

In [88]:
# defining the code for outlier detection and percentage using IQR.
def detect_outliers(data):
    outliers = []
    data = sorted(data)
    q1 = np.percentile(data, 25)
    q2 = np.percentile(data, 50)
    q3 = np.percentile(data, 75)
    print(f"q1:{q1}, q2:{q2}, q3:{q3}")

    IQR = q3-q1
    lwr_bound = q1-(1.5*IQR)
    upr_bound = q3+(1.5*IQR)
    print(f"Lower bound: {lwr_bound}, Upper bound: {upr_bound}, IQR: {IQR}")

    for i in data: 
        if (i<lwr_bound or i>upr_bound):
            outliers.append(i)
    len_outliers= len(outliers)
    print(f"Total number of outliers are: {len_outliers}")

    print(f"Total percentage of outlier is: {round(len_outliers*100/len(data),2)} %")
time: 0 ns (started: 2024-09-07 18:42:59 +00:00)
In [89]:
# Determining IQR, Lower and Upper bound and number out outliers present in each of the continous numerical feature
for feature in continous_value_feature:
  print(feature,":")
  detect_outliers(df[feature])
  print("\n")
DayOfWeek :
q1:2.0, q2:3.0, q3:5.0
Lower bound: -2.5, Upper bound: 9.5, IQR: 3.0
Total number of outliers are: 0
Total percentage of outlier is: 0.0 %


Sales :
q1:4859.0, q2:6369.0, q3:8360.0
Lower bound: -392.5, Upper bound: 13611.5, IQR: 3501.0
Total number of outliers are: 30769
Total percentage of outlier is: 3.64 %


Customers :
q1:519.0, q2:676.0, q3:893.0
Lower bound: -42.0, Upper bound: 1454.0, IQR: 374.0
Total number of outliers are: 40853
Total percentage of outlier is: 4.84 %


CompetitionDistance :
q1:710.0, q2:2325.0, q3:6880.0
Lower bound: -8545.0, Upper bound: 16135.0, IQR: 6170.0
Total number of outliers are: 83040
Total percentage of outlier is: 9.83 %


CompetitionOpenSinceMonth :
q1:6.0, q2:9.0, q3:9.0
Lower bound: 1.5, Upper bound: 13.5, IQR: 3.0
Total number of outliers are: 10297
Total percentage of outlier is: 1.22 %


CompetitionOpenSinceYear :
q1:2008.0, q2:2012.0, q3:2013.0
Lower bound: 2000.5, Upper bound: 2020.5, IQR: 5.0
Total number of outliers are: 22852
Total percentage of outlier is: 2.71 %


Promo2SinceWeek :
q1:0.0, q2:0.0, q3:22.0
Lower bound: -33.0, Upper bound: 55.0, IQR: 22.0
Total number of outliers are: 0
Total percentage of outlier is: 0.0 %


Promo2SinceYear :
q1:0.0, q2:0.0, q3:2012.0
Lower bound: -3018.0, Upper bound: 5030.0, IQR: 2012.0
Total number of outliers are: 0
Total percentage of outlier is: 0.0 %


time: 5.62 s (started: 2024-09-07 18:43:03 +00:00)
In [90]:
# Defining the function that treats outliers with the IQR technique
def treat_outliers_iqr(data):
    # Calculate the first and third quartiles
    q1, q3 = np.percentile(data, [25, 75])
    
    # Calculate the interquartile range (IQR)
    iqr = q3 - q1
    
    # Identify the outliers
    lower_bound = q1 - (1.5 * iqr)
    upper_bound = q3 + (1.5 * iqr)
    outliers = [x for x in data if x < lower_bound or x > upper_bound]
    
    # Treat the outliers (e.g., replace with the nearest quartile value)
    treated_data = [q1 if x < lower_bound else q3 if x > upper_bound else x for x in data]
    treated_data_int = [int(absolute) for absolute in treated_data]

    if x < lower_bound:
        x
    else:
        q3
        
    
    return treated_data_int
time: 16 ms (started: 2024-09-07 18:44:06 +00:00)
In [91]:
#Passing all the feature one by one from the list of continous_value_feature in our above defined function for outlier treatment
for feature in continous_value_feature:
  df[feature]= treat_outliers_iqr(df[feature])
time: 5 s (started: 2024-09-07 18:44:09 +00:00)
In [92]:
#Replotting the box plots and rechecking the percentage of outliers still available(if any) in the list of continous_value_feature.
plt.figure(figsize=(30,20))
for n,column in enumerate(continous_value_feature):
  plt.subplot(5, 4, n+1)
  sns.boxplot(df[column])
  plt.title(f'{column.title()}',weight='bold')
  plt.tight_layout()
No description has been provided for this image
time: 12 s (started: 2024-09-07 18:44:22 +00:00)
In [93]:
# Rechecking the total number of outliers and its percentage present in our dataset.
for feature in continous_value_feature:
  print(feature,":")
  detect_outliers(df[feature])
  print("\n")
DayOfWeek :
q1:2.0, q2:3.0, q3:5.0
Lower bound: -2.5, Upper bound: 9.5, IQR: 3.0
Total number of outliers are: 0
Total percentage of outlier is: 0.0 %


Sales :
q1:4859.0, q2:6369.0, q3:8360.0
Lower bound: -392.5, Upper bound: 13611.5, IQR: 3501.0
Total number of outliers are: 0
Total percentage of outlier is: 0.0 %


Customers :
q1:519.0, q2:676.0, q3:893.0
Lower bound: -42.0, Upper bound: 1454.0, IQR: 374.0
Total number of outliers are: 0
Total percentage of outlier is: 0.0 %


CompetitionDistance :
q1:710.0, q2:2325.0, q3:6880.0
Lower bound: -8545.0, Upper bound: 16135.0, IQR: 6170.0
Total number of outliers are: 0
Total percentage of outlier is: 0.0 %


CompetitionOpenSinceMonth :
q1:6.0, q2:9.0, q3:9.0
Lower bound: 1.5, Upper bound: 13.5, IQR: 3.0
Total number of outliers are: 0
Total percentage of outlier is: 0.0 %


CompetitionOpenSinceYear :
q1:2008.0, q2:2012.0, q3:2013.0
Lower bound: 2000.5, Upper bound: 2020.5, IQR: 5.0
Total number of outliers are: 0
Total percentage of outlier is: 0.0 %


Promo2SinceWeek :
q1:0.0, q2:0.0, q3:22.0
Lower bound: -33.0, Upper bound: 55.0, IQR: 22.0
Total number of outliers are: 0
Total percentage of outlier is: 0.0 %


Promo2SinceYear :
q1:0.0, q2:0.0, q3:2012.0
Lower bound: -3018.0, Upper bound: 5030.0, IQR: 2012.0
Total number of outliers are: 0
Total percentage of outlier is: 0.0 %


time: 5.64 s (started: 2024-09-07 18:44:59 +00:00)

3. Categorical Encoding¶

In [94]:
# checking the features having dtype as object
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 844338 entries, 0 to 1017190
Data columns (total 17 columns):
 #   Column                     Non-Null Count   Dtype         
---  ------                     --------------   -----         
 0   Store                      844338 non-null  int64         
 1   DayOfWeek                  844338 non-null  int64         
 2   Date                       844338 non-null  datetime64[ns]
 3   Sales                      844338 non-null  int64         
 4   Customers                  844338 non-null  int64         
 5   Promo                      844338 non-null  int64         
 6   StateHoliday               844338 non-null  object        
 7   SchoolHoliday              844338 non-null  int64         
 8   StoreType                  844338 non-null  object        
 9   Assortment                 844338 non-null  object        
 10  CompetitionDistance        844338 non-null  int64         
 11  CompetitionOpenSinceMonth  844338 non-null  int64         
 12  CompetitionOpenSinceYear   844338 non-null  int64         
 13  Promo2                     844338 non-null  int64         
 14  Promo2SinceWeek            844338 non-null  int64         
 15  Promo2SinceYear            844338 non-null  int64         
 16  PromoInterval              844338 non-null  object        
dtypes: datetime64[ns](1), int64(12), object(4)
memory usage: 116.0+ MB
time: 203 ms (started: 2024-09-07 18:45:24 +00:00)

Clearly we have "StoreType", "Assortment", "PromoInterval" as "object". To feed them as an input of our Machine Learning algorithm, we need to use some encoding technique to make dtype of these column as "integer".

In [95]:
# creating the variable that contains list of "object" dtypes
obj= ["StateHoliday", "StoreType", "Assortment", "PromoInterval"]
time: 0 ns (started: 2024-09-07 18:45:31 +00:00)
In [96]:
# checking the unique counts of object dype column which is essential to determine the type of encoding to use in various column
for unique in obj:
  print(f"{unique}: ")
  print(f"The unique values are: {df[unique].unique()}")
  print(f"Total number of unique values are: {df[unique].nunique()}")
  print("\n")
StateHoliday: 
The unique values are: ['0' 'a' 'b' 'c' 0]
Total number of unique values are: 5


StoreType: 
The unique values are: ['c' 'a' 'd' 'b']
Total number of unique values are: 4


Assortment: 
The unique values are: ['a' 'c' 'b']
Total number of unique values are: 3


PromoInterval: 
The unique values are: [0 'Jan,Apr,Jul,Oct' 'Feb,May,Aug,Nov' 'Mar,Jun,Sept,Dec']
Total number of unique values are: 4


time: 219 ms (started: 2024-09-07 18:45:34 +00:00)

From the above output we can see that the feature "StateHoliday" contains "0" as string and 0 as int at various observations. So let's convert "0"(str) to 0(int).

In [97]:
# replacing "0" to 0 and a=b=c=1 for our simplicity as they resembles that there is holiday
df["StateHoliday"].replace({"0":0, "a":1, "b":1, "c":1}, inplace=True)
time: 329 ms (started: 2024-09-07 18:46:13 +00:00)
In [98]:
# rechecking the unique counts of object dype column which is essential to determine the type of encoding to be use in various column
for unique in obj:
  print(f"{unique}: ")
  print(f"The unique values are: {df[unique].unique()}")
  print(f"Total number of unique values are: {df[unique].nunique()}")
  print("\n")
StateHoliday: 
The unique values are: [0 1]
Total number of unique values are: 2


StoreType: 
The unique values are: ['c' 'a' 'd' 'b']
Total number of unique values are: 4


Assortment: 
The unique values are: ['a' 'c' 'b']
Total number of unique values are: 3


PromoInterval: 
The unique values are: [0 'Jan,Apr,Jul,Oct' 'Feb,May,Aug,Nov' 'Mar,Jun,Sept,Dec']
Total number of unique values are: 4


time: 235 ms (started: 2024-09-07 18:46:16 +00:00)

Since we have several encoding techniques but the major ones are:

  1. Ordinal encoding: Used when the features are ordinal in nature and have some rank between them.
  2. Nominal encoding: Used when the features have equal weightage and are nominal in nature.

As our all the categorical columns are nominal in nature(do not have any rank or order) so will use One-Hot Encoding (Type of Nominal encoding) in our senario:

In [99]:
#Lets create a copy of dataframe to avoid blunders with our original dataframe
df_new=df.copy() 
time: 625 ms (started: 2024-09-07 18:46:25 +00:00)
In [ ]:
# Encode your categorical columns by dropping the first category
# df_new= pd.get_dummies(df, dtype=int)

We can use the above code to create binary dummy variable using ONE-HOT ENCODING for each of the feature but we will face the issue of "multicollinearity" or "dummy variable trap" as the information given by the one feature can be explained by the other features and this results in the high "VIF". So its better to drop the redundant feature (one category among all other category) here only.

We can do this easily by passing the argument "drop_first = True" in get_dummies without doing it manually, thanks to python code development team to make our tasks easier.

In [100]:
# Encode your categorical columns by dropping the first category
df_new= pd.get_dummies(df, dtype=int, drop_first=True)
time: 578 ms (started: 2024-09-07 18:46:32 +00:00)
In [101]:
# code to see all the features
pd.set_option('display.max_columns', None)
time: 0 ns (started: 2024-09-07 18:46:36 +00:00)
In [102]:
# Let's see first five observations of our dataset
df_new.head()
Out[102]:
Store DayOfWeek Date Sales Customers Promo StateHoliday SchoolHoliday CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear StoreType_b StoreType_c StoreType_d Assortment_b Assortment_c PromoInterval_Feb,May,Aug,Nov PromoInterval_Jan,Apr,Jul,Oct PromoInterval_Mar,Jun,Sept,Dec
0 1 5 2015-07-31 5263 555 1 0 1 1270 9 2008 0 0 0 0 1 0 0 0 0 0 0
1 2 5 2015-07-31 6064 625 1 0 1 570 11 2007 1 13 2010 0 0 0 0 0 0 1 0
2 3 5 2015-07-31 8314 821 1 0 1 14130 12 2006 1 14 2011 0 0 0 0 0 0 1 0
3 4 5 2015-07-31 8360 893 1 0 1 620 9 2009 0 0 0 0 1 0 0 1 0 0 0
4 5 5 2015-07-31 4822 559 1 0 1 6880 4 2015 0 0 0 0 0 0 0 0 0 0 0
time: 203 ms (started: 2024-09-07 18:46:39 +00:00)
In [103]:
# Verifying the dtype
df_new.info()
<class 'pandas.core.frame.DataFrame'>
Index: 844338 entries, 0 to 1017190
Data columns (total 22 columns):
 #   Column                          Non-Null Count   Dtype         
---  ------                          --------------   -----         
 0   Store                           844338 non-null  int64         
 1   DayOfWeek                       844338 non-null  int64         
 2   Date                            844338 non-null  datetime64[ns]
 3   Sales                           844338 non-null  int64         
 4   Customers                       844338 non-null  int64         
 5   Promo                           844338 non-null  int64         
 6   StateHoliday                    844338 non-null  int64         
 7   SchoolHoliday                   844338 non-null  int64         
 8   CompetitionDistance             844338 non-null  int64         
 9   CompetitionOpenSinceMonth       844338 non-null  int64         
 10  CompetitionOpenSinceYear        844338 non-null  int64         
 11  Promo2                          844338 non-null  int64         
 12  Promo2SinceWeek                 844338 non-null  int64         
 13  Promo2SinceYear                 844338 non-null  int64         
 14  StoreType_b                     844338 non-null  int64         
 15  StoreType_c                     844338 non-null  int64         
 16  StoreType_d                     844338 non-null  int64         
 17  Assortment_b                    844338 non-null  int64         
 18  Assortment_c                    844338 non-null  int64         
 19  PromoInterval_Feb,May,Aug,Nov   844338 non-null  int64         
 20  PromoInterval_Jan,Apr,Jul,Oct   844338 non-null  int64         
 21  PromoInterval_Mar,Jun,Sept,Dec  844338 non-null  int64         
dtypes: datetime64[ns](1), int64(21)
memory usage: 148.2 MB
time: 47 ms (started: 2024-09-07 18:46:51 +00:00)

What all categorical encoding techniques have you used & why did you use those techniques?¶

a. We have used one-hot encoding technique to change our categorical features of object type into int type by creating their dummies so that it becomes compatible to feed it into various ML algorithms in future.

b. Since, we have 3 to 4 unique orderless categories in all the categorical features (which is less in number). So, it is good to use Nominal encoding technique than ordinal.

4. Feature Manipulation & Selection¶

1. Feature Manipulation¶

In [104]:
# Manipulate Features to minimize feature correlation and create new 
# Checking the first five observation of the dataset we have to deal with.
df_new.head()
Out[104]:
Store DayOfWeek Date Sales Customers Promo StateHoliday SchoolHoliday CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear StoreType_b StoreType_c StoreType_d Assortment_b Assortment_c PromoInterval_Feb,May,Aug,Nov PromoInterval_Jan,Apr,Jul,Oct PromoInterval_Mar,Jun,Sept,Dec
0 1 5 2015-07-31 5263 555 1 0 1 1270 9 2008 0 0 0 0 1 0 0 0 0 0 0
1 2 5 2015-07-31 6064 625 1 0 1 570 11 2007 1 13 2010 0 0 0 0 0 0 1 0
2 3 5 2015-07-31 8314 821 1 0 1 14130 12 2006 1 14 2011 0 0 0 0 0 0 1 0
3 4 5 2015-07-31 8360 893 1 0 1 620 9 2009 0 0 0 0 1 0 0 1 0 0 0
4 5 5 2015-07-31 4822 559 1 0 1 6880 4 2015 0 0 0 0 0 0 0 0 0 0 0
time: 16 ms (started: 2024-09-07 18:46:57 +00:00)
In [107]:
# Extracting date, month and year from Date feature
df_new["Day"]= df_new["Date"].dt.day
df_new["Month"]= df_new["Date"].dt.month
df_new["Year"]= df_new["Date"].dt.year
df_new["Week"]= df_new["Date"].dt.weekday
time: 594 ms (started: 2024-09-07 18:47:46 +00:00)
In [108]:
#checking first 5 observations
df_new.head()
Out[108]:
Store DayOfWeek Date Sales Customers Promo StateHoliday SchoolHoliday CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear StoreType_b StoreType_c StoreType_d Assortment_b Assortment_c PromoInterval_Feb,May,Aug,Nov PromoInterval_Jan,Apr,Jul,Oct PromoInterval_Mar,Jun,Sept,Dec Day Month Year Week
0 1 5 2015-07-31 5263 555 1 0 1 1270 9 2008 0 0 0 0 1 0 0 0 0 0 0 31 7 2015 4
1 2 5 2015-07-31 6064 625 1 0 1 570 11 2007 1 13 2010 0 0 0 0 0 0 1 0 31 7 2015 4
2 3 5 2015-07-31 8314 821 1 0 1 14130 12 2006 1 14 2011 0 0 0 0 0 0 1 0 31 7 2015 4
3 4 5 2015-07-31 8360 893 1 0 1 620 9 2009 0 0 0 0 1 0 0 1 0 0 0 31 7 2015 4
4 5 5 2015-07-31 4822 559 1 0 1 6880 4 2015 0 0 0 0 0 0 0 0 0 0 0 31 7 2015 4
time: 16 ms (started: 2024-09-07 18:47:54 +00:00)
In [109]:
# deriving 2 new features:
# 1. "CompetitionDuration -> Duration from which two stores are competiting"
# 2. "PromoDuration -> Duration from which the store is involved in promotion"
##    Final values are in Months.
df_new["CompetitionDuration"]= (df_new["Year"]-df_new["CompetitionOpenSinceYear"])*12 + (df_new["Month"]-df_new["CompetitionOpenSinceMonth"])
df_new["CompetitionDuration"] = df_new["CompetitionDuration"].map(lambda x: int(0) if x < 0 else int(x)).fillna(0)

# Creating PromoDuration feature
df_new["PromoDuration"]= (df_new["Year"]-df_new["Promo2SinceYear"])*12 + (df_new["Week"]-df_new["Promo2SinceWeek"])*7/30.5
df_new["PromoDuration"] = df_new["PromoDuration"].map(lambda x: int(0) if x < 0 else int(x)).fillna(0) * df["Promo2"]
time: 703 ms (started: 2024-09-07 18:49:07 +00:00)
In [114]:
df_new.describe().columns.tolist()
Out[114]:
['Store',
 'DayOfWeek',
 'Date',
 'Sales',
 'Customers',
 'Promo',
 'StateHoliday',
 'SchoolHoliday',
 'CompetitionDistance',
 'CompetitionOpenSinceMonth',
 'CompetitionOpenSinceYear',
 'Promo2',
 'Promo2SinceWeek',
 'Promo2SinceYear',
 'StoreType_b',
 'StoreType_c',
 'StoreType_d',
 'Assortment_b',
 'Assortment_c',
 'PromoInterval_Feb,May,Aug,Nov',
 'PromoInterval_Jan,Apr,Jul,Oct',
 'PromoInterval_Mar,Jun,Sept,Dec',
 'Day',
 'Month',
 'Year',
 'Week',
 'CompetitionDuration',
 'PromoDuration']
time: 1.36 s (started: 2024-09-07 18:53:12 +00:00)
In [110]:
# checking first 5 observations after feature extraction
df_new.head()
Out[110]:
Store DayOfWeek Date Sales Customers Promo StateHoliday SchoolHoliday CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear StoreType_b StoreType_c StoreType_d Assortment_b Assortment_c PromoInterval_Feb,May,Aug,Nov PromoInterval_Jan,Apr,Jul,Oct PromoInterval_Mar,Jun,Sept,Dec Day Month Year Week CompetitionDuration PromoDuration
0 1 5 2015-07-31 5263 555 1 0 1 1270 9 2008 0 0 0 0 1 0 0 0 0 0 0 31 7 2015 4 82 0
1 2 5 2015-07-31 6064 625 1 0 1 570 11 2007 1 13 2010 0 0 0 0 0 0 1 0 31 7 2015 4 92 57
2 3 5 2015-07-31 8314 821 1 0 1 14130 12 2006 1 14 2011 0 0 0 0 0 0 1 0 31 7 2015 4 103 45
3 4 5 2015-07-31 8360 893 1 0 1 620 9 2009 0 0 0 0 1 0 0 1 0 0 0 31 7 2015 4 70 0
4 5 5 2015-07-31 4822 559 1 0 1 6880 4 2015 0 0 0 0 0 0 0 0 0 0 0 31 7 2015 4 3 0
time: 16 ms (started: 2024-09-07 18:49:20 +00:00)

2. Feature Selection¶

In [220]:
# Number of columns for the subplots (4 in each row)
n_cols = 4

numeric_cols = [col for col in df_new.describe().columns.tolist() if pd.api.types.is_numeric_dtype(df_new[col])]
n_rows = int(np.ceil(len(numeric_cols) / n_cols))

fig, axs = plt.subplots(n_rows, n_cols, figsize=(20, 15))
axs = axs.flatten()  # Flatten to easily index into 1D array

for idx, col in enumerate(numeric_cols):
    ax = axs[idx]
    feature = df_new[col]
    label = df_new['Sales']

    correlation = feature.corr(label)

    sns.scatterplot(x=feature, y=label, color="gray", ax=ax)
    ax.set_xlabel(col)
    ax.set_ylabel('Sales')
    ax.set_title(f'Sales vs {col} - corr: {correlation:.2f}')
    
    # Fit and plot regression line
    z = np.polyfit(feature, label, 1)
    y_hat = np.poly1d(z)(feature)
    ax.plot(feature, y_hat, "r--", lw=1)

# Remove empty subplots if there are any
for i in range(len(numeric_cols), n_rows * n_cols):
    fig.delaxes(axs[i])

# Adjust layout
plt.tight_layout()
plt.show()
No description has been provided for this image
time: 1min 24s (started: 2024-09-08 09:40:03 +00:00)
In [112]:
# Select your features wisely to avoid overfitting
plt.figure(figsize=(20,15))
sns.heatmap(abs(round(df_new.corr(),3)), annot=True, cmap=plt.cm.CMRmap)
Out[112]:
<Axes: >
No description has been provided for this image
time: 3.91 s (started: 2024-09-07 18:50:49 +00:00)
In [ ]:
 

Let's include only those features in our final dataframe that are highly impacting on the dependent variable i.e Sales. For this we are using Variance Inflation Factor technique to determine multicolinearity.

In [118]:
import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

def calc_vif(X):
    # Ensure all columns are numeric
    X = X.apply(pd.to_numeric, errors='coerce')
    
    # Drop rows with NaN values
    X = X.dropna()
    
    # Add a constant to the DataFrame for the intercept term
    X = add_constant(X)
    
    # Calculate VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    
    return vif
time: 0 ns (started: 2024-09-07 19:14:42 +00:00)

Calculating VIF(Variance Inflation Factor) by excluding:

  1. "Sales" -> As it is target variable
  2. "Store" -> As it is StoreID and not giving any information

We have extracted new features "CompetitionDuration" and "PromoDuration" from features "Promo2SinceWeek", "Promo2SinceYear", "CompetitionOpenSinceMonth", "CompetitionOpenSinceYear", "Month", "Year", "Week". So we can exclude them while calculating VIF.

In [119]:
# Calculate VIF by excluding certain columns
features = df_new[[i for i in df_new.describe().columns if i not in ["Store", "Sales", "Promo2SinceWeek", "Promo2SinceYear", "CompetitionOpenSinceMonth", "CompetitionOpenSinceYear", "Month", "Year", "Week"]]]
vif_df = calc_vif(features)

print(vif_df)
                         variables       VIF
0                            const  0.000000
1                        DayOfWeek  0.998467
2                             Date  1.124435
3                        Customers  1.042521
4                            Promo  1.000328
5                     StateHoliday  1.000024
6                    SchoolHoliday  1.000016
7              CompetitionDistance  0.999714
8                           Promo2  1.001310
9                      StoreType_b  1.009855
10                     StoreType_c  1.012996
11                     StoreType_d  1.089001
12                    Assortment_b  1.004681
13                    Assortment_c  1.038251
14   PromoInterval_Feb,May,Aug,Nov  0.999987
15   PromoInterval_Jan,Apr,Jul,Oct  0.999781
16  PromoInterval_Mar,Jun,Sept,Dec  1.002139
17                             Day  1.001386
18             CompetitionDuration  1.005574
19                   PromoDuration  1.004631
time: 38.3 s (started: 2024-09-07 19:14:45 +00:00)

Since, "Promo2","PromoInterval_Jan,Apr,Jul,Oct", "PromoInterval_Feb,May,Aug,Nov", "PromoInterval_Mar,Jun,Sept,Dec" are having high VIF values and "PromoInterval_Jan,Apr,Jul,Oct" is having least corelation with "Sales". So, let's exclude "PromoInterval_Jan,Apr,Jul,Oct"

In [120]:
# calculating the vif by excluding some features
calc_vif(df_new[[i for i in df_new.describe().columns if i not in ["Store", "Sales", "Promo2SinceWeek", "Promo2SinceYear", "CompetitionOpenSinceMonth", "CompetitionOpenSinceYear", "Month", "Year","Week", "PromoInterval_Jan,Apr,Jul,Oct"]]])
Out[120]:
variables VIF
0 const 0.000000
1 DayOfWeek 0.998467
2 Date 1.124435
3 Customers 1.042521
4 Promo 1.000328
5 StateHoliday 1.000024
6 SchoolHoliday 1.000016
7 CompetitionDistance 0.999714
8 Promo2 1.001310
9 StoreType_b 1.009855
10 StoreType_c 1.012996
11 StoreType_d 1.089001
12 Assortment_b 1.004681
13 Assortment_c 1.038251
14 PromoInterval_Feb,May,Aug,Nov 0.999987
15 PromoInterval_Mar,Jun,Sept,Dec 1.002139
16 Day 1.001386
17 CompetitionDuration 1.005574
18 PromoDuration 1.004631
time: 30 s (started: 2024-09-07 19:16:15 +00:00)

Since "Assortment_b" and "Store_type_b" are also showing high corelation i.e 0.72 so we are gaining same information from both the features. Its better to exclude the feature "Assortment_b" which is having less corelation with Sales(0.047).

In [121]:
# calculating the vif by excluding some features
calc_vif(df_new[[i for i in df_new.describe().columns if i not in ["Store", "Sales", "Promo2SinceWeek", "Promo2SinceYear", "CompetitionOpenSinceMonth", "CompetitionOpenSinceYear", "Month", "Year","Week", "PromoInterval_Jan,Apr,Jul,Oct","Assortment_b"]]])
Out[121]:
variables VIF
0 const 0.000000
1 DayOfWeek 0.998467
2 Date 1.124430
3 Customers 1.042521
4 Promo 1.000328
5 StateHoliday 1.000024
6 SchoolHoliday 1.000016
7 CompetitionDistance 0.999714
8 Promo2 1.001310
9 StoreType_b 1.009855
10 StoreType_c 1.012996
11 StoreType_d 1.089001
12 Assortment_c 1.038251
13 PromoInterval_Feb,May,Aug,Nov 0.999987
14 PromoInterval_Mar,Jun,Sept,Dec 1.002139
15 Day 1.001386
16 CompetitionDuration 1.005574
17 PromoDuration 1.004631
time: 26.2 s (started: 2024-09-07 19:17:15 +00:00)

Since "Promo2" and "PromoDuration" are highly correlated having correlation 0.69 and "PromoDuration" is least correlated with "Sales". So, let's exclude "PromoDuration" from VIF.

In [122]:
# calculating the vif by excluding some features
calc_vif(df_new[[i for i in df_new.describe().columns if i not in ["Store", "Sales", "Promo2SinceWeek", "Promo2SinceYear", "CompetitionOpenSinceMonth", "CompetitionOpenSinceYear", "Month","Week", "Year", "PromoInterval_Jan,Apr,Jul,Oct","Assortment_b","PromoDuration"]]])
Out[122]:
variables VIF
0 const 0.000000
1 DayOfWeek 0.998467
2 Date 1.038805
3 Customers 1.042521
4 Promo 1.000328
5 StateHoliday 1.000024
6 SchoolHoliday 1.000016
7 CompetitionDistance 0.999714
8 Promo2 1.001310
9 StoreType_b 1.009855
10 StoreType_c 1.012996
11 StoreType_d 1.089001
12 Assortment_c 1.038251
13 PromoInterval_Feb,May,Aug,Nov 0.999987
14 PromoInterval_Mar,Jun,Sept,Dec 1.002139
15 Day 1.001386
16 CompetitionDuration 1.005574
time: 21.8 s (started: 2024-09-07 19:17:42 +00:00)

Great !!! We are getting very good VIF's (Less then 10). Now let's move forward and store the selected features in a new dataframe.

In [123]:
# New Dataframe which will be our final dataframe
final_df = df_new[["DayOfWeek", "Customers","Promo","StateHoliday","SchoolHoliday","CompetitionDistance","Promo2","StoreType_b","StoreType_c","StoreType_d","Assortment_c","PromoInterval_Feb,May,Aug,Nov","PromoInterval_Mar,Jun,Sept,Dec","Day","CompetitionDuration","Sales"]]
time: 31 ms (started: 2024-09-07 19:18:04 +00:00)
In [124]:
# Checking info of our final dataframe
final_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 844338 entries, 0 to 1017190
Data columns (total 16 columns):
 #   Column                          Non-Null Count   Dtype
---  ------                          --------------   -----
 0   DayOfWeek                       844338 non-null  int64
 1   Customers                       844338 non-null  int64
 2   Promo                           844338 non-null  int64
 3   StateHoliday                    844338 non-null  int64
 4   SchoolHoliday                   844338 non-null  int64
 5   CompetitionDistance             844338 non-null  int64
 6   Promo2                          844338 non-null  int64
 7   StoreType_b                     844338 non-null  int64
 8   StoreType_c                     844338 non-null  int64
 9   StoreType_d                     844338 non-null  int64
 10  Assortment_c                    844338 non-null  int64
 11  PromoInterval_Feb,May,Aug,Nov   844338 non-null  int64
 12  PromoInterval_Mar,Jun,Sept,Dec  844338 non-null  int64
 13  Day                             844338 non-null  int32
 14  CompetitionDuration             844338 non-null  int64
 15  Sales                           844338 non-null  int64
dtypes: int32(1), int64(15)
memory usage: 106.3 MB
time: 578 ms (started: 2024-09-07 19:18:04 +00:00)
What all feature selection methods have you used and why?¶

We have plotted the seaborn's scatterplot and seaborn's heatmap to see the relationship of each of the feature with target variable and observed the following correlations:

  1. Positive Corelation- Customers, Open, Promo, SchoolHoliday, CompetitionOpenSinceYear, Assortment_b, Assortment_c, Month, Year, Week, PromoDuration.
  2. Negative Corelation- DayOfWeek, StateHoliday, ComptitionDistance, CompetitionOpenSinceMonth, Promo2, Promo2SinceWeek,Promo2SinceYear, 'PromoInterval_Feb,May,Aug,Nov, 'PromoInterval_Jan,Apr,Jul,Oct, PromoInterval_Mar,Jun,Sept,Dec, Day, CompetitionDuration.
  3. No Corelation- Store, StoreType_c, StoreType_d

We have used Filter method with correlation heatmap and VIF(various inflation factor) and excluded some of the features that were creating noise while model implimentation.Also, we have removed multicolinearity and selected features that are highly dependant on our target variable.

Which all features you found important and why?¶

We have selected "DayOfWeek","Customers","Promo","StateHoliday","SchoolHoliday","CompetitionDistance","Promo2","StoreType_b","StoreType_c","StoreType_d","Assortment_c","PromoInterval_Feb,May,Aug,Nov","PromoInterval_Mar,Jun,Sept,Dec","Day","CompetitionDuration" as our final features as they are highly corelated with the target variable (Sales) and no two features are providing the same information.

5. Data Transformation¶

Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?¶

In [125]:
# checking which of the variables are continous in nature
for i in final_df.columns:
  print(f"The number of unique counts in feature {i} is: {final_df[i].nunique()}")
The number of unique counts in feature DayOfWeek is: 7
The number of unique counts in feature Customers is: 1390
The number of unique counts in feature Promo is: 2
The number of unique counts in feature StateHoliday is: 2
The number of unique counts in feature SchoolHoliday is: 2
The number of unique counts in feature CompetitionDistance is: 555
The number of unique counts in feature Promo2 is: 2
The number of unique counts in feature StoreType_b is: 2
The number of unique counts in feature StoreType_c is: 2
The number of unique counts in feature StoreType_d is: 2
The number of unique counts in feature Assortment_c is: 2
The number of unique counts in feature PromoInterval_Feb,May,Aug,Nov is: 2
The number of unique counts in feature PromoInterval_Mar,Jun,Sept,Dec is: 2
The number of unique counts in feature Day is: 31
The number of unique counts in feature CompetitionDuration is: 173
The number of unique counts in feature Sales is: 12755
time: 406 ms (started: 2024-09-07 19:21:23 +00:00)
In [126]:
# Storing the continous variables(number of unique counts >40) in a separate list and treating them in order to make gaussian distribution
cont_variables= ["Sales", "Customers", "CompetitionDistance", "CompetitionDuration"]
time: 735 ms (started: 2024-09-07 19:21:35 +00:00)
In [127]:
# Checking the distribution of each continous variable from our final dataframe
plt.figure(figsize=(20,5))
print("Before Applying Transformation")
for n,col in enumerate(cont_variables):
  plt.subplot(1,5,n+1)
  sns.distplot(final_df[col])
  plt.title(f'Distribution of {col}')
  plt.tight_layout()
Before Applying Transformation
No description has been provided for this image
time: 30.7 s (started: 2024-09-07 19:21:42 +00:00)

It is clear from above distribution plots that they are not perfectly normally distributed. So we will apply some transformation techniques to get the normally disrtibuted data as it is one of the prior underlying assumption for the linear models such as Linear Regression.

In [128]:
import matplotlib.pyplot as plt
import scipy.stats as stat
import pylab 
#### If you want to check whether feature is guassian or normal distributed
#### Q-Q plot
def plot_data(df,feature):
    stat.probplot(df[feature],dist='norm',plot=pylab)
time: 656 ms (started: 2024-09-07 19:25:31 +00:00)
In [129]:
plt.figure(figsize=(20,5))
for num,column in enumerate(cont_variables):
  plt.subplot(1,5,num+1)
  print(f"Q-Q Plot for variable: {column}")
  plot_data(final_df,column)
plt.show()
Q-Q Plot for variable: Sales
Q-Q Plot for variable: Customers
Q-Q Plot for variable: CompetitionDistance
Q-Q Plot for variable: CompetitionDuration
No description has been provided for this image
time: 10.1 s (started: 2024-09-07 19:25:45 +00:00)

As it is clear from above plots that our continous numeric features are not following perfectly normal distribution trend. So, now we will try various transformations techniques to get gaussian distributed curve.

Let's create different copies and check which transformation is best for each feature

In [130]:
# Creating 5 different copies to check the distribution of each of the variable
test_df1=final_df.copy()
test_df2=final_df.copy()
test_df3=final_df.copy()
test_df4=final_df.copy()
test_df5=final_df.copy()
time: 688 ms (started: 2024-09-07 19:26:22 +00:00)

Logarithmic Transformation

In [131]:
# Applying transformation on the above considered columns
test_df1['Sales']=np.log(test_df1['Sales']+1)
test_df1['Customers']=np.log(test_df1['Customers']+1)
test_df1['CompetitionDistance']=np.log(test_df1['CompetitionDistance']+1)
test_df1['CompetitionDuration']=np.log(test_df1['CompetitionDuration']+1)

# Checking the distribution of each continous variable by excluding 0 from our final dataframe
plt.figure(figsize=(20,5))
print("After Applying Transformation")
for n,col in enumerate(cont_variables):
  plt.subplot(1,5,n+1)
  sns.distplot(test_df1[col])
  plt.title(f'Distribution of {col}')
  plt.tight_layout()
After Applying Transformation
No description has been provided for this image
time: 22.6 s (started: 2024-09-07 19:26:53 +00:00)
In [132]:
plt.figure(figsize=(20,5))
for num,column in enumerate(cont_variables):
  plt.subplot(1,5,num+1)
  print(f"Q-Q Plot for variable: {column}")
  plot_data(test_df1,column)
plt.show()
Q-Q Plot for variable: Sales
Q-Q Plot for variable: Customers
Q-Q Plot for variable: CompetitionDistance
Q-Q Plot for variable: CompetitionDuration
No description has been provided for this image
time: 9.88 s (started: 2024-09-07 19:27:22 +00:00)

Reciprocal Trnasformation

In [133]:
# Applying transformation on the above considered columns
test_df2['Sales']=1/(test_df2['Sales']+1)
test_df2['Customers']=1/(test_df2['Customers']+1)
test_df2['CompetitionDistance']=1/(test_df2['CompetitionDistance']+1)
test_df2['CompetitionDuration']=1/(test_df2['CompetitionDuration']+1)


# Checking the distribution of each continous variable by excluding 0 from our final dataframe
plt.figure(figsize=(20,5))
print("After Applying Transformation")
for n,col in enumerate(cont_variables):
  plt.subplot(1,5,n+1)
  sns.distplot(test_df2[col])
  plt.title(f'Distribution of {col}')
  plt.tight_layout()
After Applying Transformation
No description has been provided for this image
time: 18.5 s (started: 2024-09-07 19:28:00 +00:00)
In [134]:
plt.figure(figsize=(20,5))
for num,column in enumerate(cont_variables):
  plt.subplot(1,5,num+1)
  print(f"Q-Q Plot for variable: {column}")
  plot_data(test_df2,column)
plt.show()
Q-Q Plot for variable: Sales
Q-Q Plot for variable: Customers
Q-Q Plot for variable: CompetitionDistance
Q-Q Plot for variable: CompetitionDuration
No description has been provided for this image
time: 11.6 s (started: 2024-09-07 19:28:38 +00:00)

Square Root Transformation

In [135]:
# Applying transformation on the above considered columns
test_df3['Sales']=(test_df3['Sales'])**(1/2)
test_df3['Customers']=(test_df3['Customers'])**(1/2)
test_df3['CompetitionDistance']=(test_df3['CompetitionDistance'])**(1/2)
test_df3['CompetitionDuration']=(test_df3['CompetitionDuration'])**(1/2)

# Checking the distribution of each continous variable by excluding 0 from our final dataframe
plt.figure(figsize=(20,5))
print("After Applying Transformation")
for n,col in enumerate(cont_variables):
  plt.subplot(1,5,n+1)
  sns.distplot(test_df3[col])
  plt.title(f'Distribution of {col}')
  plt.tight_layout()
After Applying Transformation
No description has been provided for this image
time: 28.9 s (started: 2024-09-07 19:29:23 +00:00)
In [136]:
plt.figure(figsize=(20,5))
for num,column in enumerate(cont_variables):
  plt.subplot(1,5,num+1)
  print(f"Q-Q Plot for variable: {column}")
  plot_data(test_df3,column)
plt.show()
Q-Q Plot for variable: Sales
Q-Q Plot for variable: Customers
Q-Q Plot for variable: CompetitionDistance
Q-Q Plot for variable: CompetitionDuration
No description has been provided for this image
time: 56.6 s (started: 2024-09-07 19:29:56 +00:00)

Exponential Transformation

In [137]:
# Applying transformation on the above considered columns
test_df4['Sales']=(test_df4['Sales'])**(1/1.2)
test_df4['Customers']=(test_df4['Customers'])**(1/1.2)
test_df4['CompetitionDistance']=(test_df4['CompetitionDistance'])**(1/1.2)
test_df4['CompetitionDuration']=(test_df4['CompetitionDuration'])**(1/1.2)

# Checking the distribution of each continous variable by excluding 0 from our final dataframe
plt.figure(figsize=(20,5))
print("After Applying Transformation")
for n,col in enumerate(cont_variables):
  plt.subplot(1,5,n+1)
  sns.distplot(test_df4[col])
  plt.title(f'Distribution of {col}')
  plt.tight_layout()
After Applying Transformation
No description has been provided for this image
time: 30.8 s (started: 2024-09-07 19:30:53 +00:00)
In [194]:
plt.figure(figsize=(20,5))
for num,column in enumerate(cont_variables):
  plt.subplot(1,5,num+1)
  print(f"Q-Q Plot for variable: {column}")
  plot_data(test_df4,column)
plt.show()
Q-Q Plot for variable: Sales
Q-Q Plot for variable: Customers
Q-Q Plot for variable: CompetitionDistance
Q-Q Plot for variable: CompetitionDuration
No description has been provided for this image
time: 26.3 s (started: 2024-09-08 08:59:08 +00:00)

From the above plots it is clear that:

  1. Feature "Sales" needs square root transformation
  2. Feature "Customers" needs square root transformation
  3. Feature "CompetitionDistance" needs square root transformation
  4. Feature "CompetitionDuration" no transformation needed
In [139]:
# Applying transformation on the above considered columns
## Square root transformation
final_df['Sales']=(final_df['Sales'])**(1/2)
final_df['Customers']=(final_df['Customers'])**(1/2)
final_df['CompetitionDistance']=(final_df['CompetitionDistance'])**(1/2)
time: 2.33 s (started: 2024-09-07 19:32:04 +00:00)

Now we have normally distributed data for all of the continous variables.

6. Data Scaling¶

Before scaling our data let's just seperate our "x" and "y" variables as we do not have to scale our y variable (Target variable).

In [140]:
# Separating "x" and "y" variables
x= final_df[["DayOfWeek", "Customers","Promo","StateHoliday","SchoolHoliday","CompetitionDistance","Promo2","StoreType_b","StoreType_c","StoreType_d","Assortment_c","PromoInterval_Feb,May,Aug,Nov","PromoInterval_Mar,Jun,Sept,Dec","Day","CompetitionDuration"]]
y= final_df[['Sales']]
print(x.shape)
print(y.shape)
(844338, 15)
(844338, 1)
time: 1.03 s (started: 2024-09-07 19:32:12 +00:00)
In [141]:
# Scaling your data
# Importing StandardScaler library
from sklearn.preprocessing import StandardScaler
time: 93 ms (started: 2024-09-07 19:32:17 +00:00)
In [142]:
# Creating object
std_regressor= StandardScaler()
time: 16.1 s (started: 2024-09-07 19:32:21 +00:00)
In [143]:
# Fit and Transform
x= std_regressor.fit_transform(x)
time: 812 ms (started: 2024-09-07 19:32:46 +00:00)
Which method have you used to scale you data and why?¶

We have used StandardScaler of sklearn library to scale our data. This is important for us, as features on different scales can lead to poor performance or slow convergence. Standardizing the features also makes it easier to compare different features or observe the effect of a feature on the target variable("Sales") by comparing the magnitude of its coefficient. Additionally, we are going to apply linear regression model for which having normally distributed data is the statistical assumption of the model, which standardization can help enforce.

7. Dimesionality Reduction¶

Do you think that dimensionality reduction is needed? Explain Why?¶
In [ ]:
# DImensionality Reduction (If needed)
## Not needed for now
Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)¶
In [ ]:
# DImensionality Reduction (If needed)
## Not needed for now

8. Data Splitting¶

In [144]:
# Split your data to train and test. Choose Splitting ratio wisely.
# Importing train_test_split from sklearn
from sklearn.model_selection import train_test_split
time: 0 ns (started: 2024-09-07 19:33:38 +00:00)
In [145]:
# Spliting the dataset
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
time: 953 ms (started: 2024-09-07 19:33:41 +00:00)
In [146]:
# Checking the shape after spliting
x_train.shape, x_test.shape, y_train.shape, y_test.shape
Out[146]:
((675470, 15), (168868, 15), (675470, 1), (168868, 1))
time: 187 ms (started: 2024-09-07 19:33:44 +00:00)
What data splitting ratio have you used and why?¶

Since our dataset is huge and have nearly 10lakh obsevations. So, We have assigned 80% data into train set and 20% into the test set with random_state=0 so that we do not get different observations in every split.

7. ML Model Implementation¶

In [147]:
# Importing essential libraries to check the accuracy
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_percentage_error
time: 0 ns (started: 2024-09-07 19:34:26 +00:00)
In [148]:
# Defining the function that calculated regression metrics
def regression_metrics(y_train_actual,y_train_pred,y_test_actual,y_test_pred):
  print("-"*50)
  ## mean_absolute_error
  MAE_train= mean_absolute_error(y_train,y_train_pred)
  print("MAE on train is:" ,MAE_train)
  MAE_test= mean_absolute_error(y_test,y_test_pred)
  print("MAE on test is:" ,MAE_test)

  print("-"*50)

  ## mean_squared_error
  MSE_train= mean_squared_error(y_train, y_train_pred)
  print("MSE on train is:" ,MSE_train)
  MSE_test  = mean_squared_error(y_test, y_test_pred)
  print("MSE on test is:" ,MSE_test)

  print("-"*50)

  ## root_mean_squared_error
  RMSE_train = np.sqrt(MSE_train)
  print("RMSE on train is:" ,RMSE_train)
  RMSE_test = np.sqrt(MSE_test)
  print("RMSE on test is:" ,RMSE_test)

  print("-"*50)

  ## mean_absolute_percentage_error
  MAPE_train = mean_absolute_percentage_error(y_train, y_train_pred)*100
  print("MAPE on train is:" ,MAPE_train, " %")
  MAPE_test = mean_absolute_percentage_error(y_test, y_test_pred)*100
  print("MAPE on test is:" ,MAPE_test, " %")

  print("-"*50)

  ## r2_score
  R2_train= r2_score(y_train,y_train_pred)
  print("R2 on train is:" ,R2_train)  
  R2_test= r2_score(y_test,y_test_pred)
  print("R2 on test is:" ,R2_test)

  print("-"*50)

  Accuracy_train= 100- MAPE_train
  print("Accuracy of train is:" ,Accuracy_train, " %")
  Accuracy_test= 100- MAPE_test
  print("Accuracy of test is:" ,Accuracy_test, " %")

  print("-"*50)
time: 94 ms (started: 2024-09-07 19:34:48 +00:00)

ML Model - 1 - Linear Regression¶

In [149]:
# Importing LinearRegression from sklearn
from sklearn.linear_model import LinearRegression
time: 0 ns (started: 2024-09-07 19:34:53 +00:00)
In [150]:
# ML Model - 1 Implementation
linear_regressor= LinearRegression()

# Fit the Algorithm
linear_regressor.fit(x_train,y_train)

# Predict the model
y_train_regression_pred= linear_regressor.predict(x_train)
y_test_regression_pred= linear_regressor.predict(x_test)
time: 2.7 s (started: 2024-09-07 19:34:57 +00:00)
In [151]:
# Checking the coefficients
linear_regressor.coef_
Out[151]:
array([[ 0.01897995, 12.57582977,  2.78598063, -0.04033691,  0.10933249,
         1.65153343,  1.18626577,  0.06662023, -0.30609499,  2.78863733,
         0.77648252, -0.25561243, -0.47094053,  0.01665519,  0.24632187]])
time: 78 ms (started: 2024-09-07 19:35:04 +00:00)
In [152]:
# Checking the intercept
linear_regressor.intercept_
Out[152]:
array([80.18759328])
time: 78 ms (started: 2024-09-07 19:35:08 +00:00)

1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.¶

In [153]:
# Calculating the regression metrics
regression_metrics(y_train,y_train_regression_pred,y_test,y_test_regression_pred)
--------------------------------------------------
MAE on train is: 5.399363906213304
MAE on test is: 5.397135045548209
--------------------------------------------------
MSE on train is: 51.838213377061166
MSE on test is: 51.85624827426843
--------------------------------------------------
RMSE on train is: 7.199875927893561
RMSE on test is: 7.201128263978391
--------------------------------------------------
MAPE on train is: 6.756242165207454  %
MAPE on test is: 6.759115324427814  %
--------------------------------------------------
R2 on train is: 0.7642276479767802
R2 on test is: 0.7648823059686072
--------------------------------------------------
Accuracy of train is: 93.24375783479255  %
Accuracy of test is: 93.24088467557219  %
--------------------------------------------------
time: 203 ms (started: 2024-09-07 19:35:27 +00:00)
In [155]:
import numpy as np
import matplotlib.pyplot as plt

# Calculating residuals
residuals = y_test - y_test_regression_pred
Mean = round(np.mean(residuals), 2)
Median = round(np.median(residuals), 2)

# Plot residuals
plt.figure(figsize=(15, 8))
plt.scatter(y_test, residuals, c="dodgerblue")
plt.title("Residual Plot for Metric Evaluation")
plt.xlabel('Predicted Sales')
plt.ylabel('Residual Error')

# Add horizontal line at mean and median values of residuals
plt.axhline(y=Mean, color='red', linestyle='--', label=f'Mean: {Mean}')
plt.axhline(y=Median, color='green', linestyle='--', label=f'Median: {Median}')
plt.legend()

plt.show()
No description has been provided for this image
time: 2.08 s (started: 2024-09-07 19:37:26 +00:00)

We have started with the most basic and simple ML model i.e Linear Regression. We have tried to evaluate the most important regression metics on both the train and test datesets so that we can conclude our ML model. Here for Linear Regression, we can observe that both the r2 scores are pretty close which explains that on test dataset and our model is following the correct way.

We can comprehend that 'dependent' and 'independent' variables are not much following direct linear relationship with each other thats why we got 0.75 maximum r2 score in LR model implementation.

Also we are getting the Mean=0.03 and Median=-0.33 in our residual plot that resembles our predictions are closely following Normal Distribution (Mean=Median=0) because of this our accuracy (100-MAPE) is pretty good in simple linear model.

In order to fetch good and more accurate results, we shall go for cross- Validation & Hyperparameter Tuning of 'Lasso', 'Ridge' and 'Elastic Net' models.

2. Cross- Validation & Hyperparameter Tuning¶

Rigde Regression

In [156]:
# import ridge regression from sklearn library
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

# Creating Ridge instance
ridge= Ridge()

# Defining parameters
parameters = {"alpha": [1e-1,1,5,7,10,11,14,15,16,17], "max_iter":[1,2,3]}

# Train the model
ridgeR = GridSearchCV(ridge, parameters, scoring='neg_mean_squared_error', cv=3)
ridgeR.fit(x_train,y_train)

# Predict the output
y_train_ridge_pred = ridgeR.predict(x_train)
y_test_ridge_pred = ridgeR.predict(x_test)

# Printing the best parameters obtained by GridSearchCV
print(f"The best alpha value found out to be: {ridgeR.best_params_}")
print(f"Negative mean square error is: {ridgeR.best_score_}")
The best alpha value found out to be: {'alpha': 5, 'max_iter': 1}
Negative mean square error is: -51.84161630198643
time: 18 s (started: 2024-09-07 19:37:41 +00:00)
In [157]:
# Calculating regression metrics for Ridge
regression_metrics(y_train,y_train_ridge_pred,y_test,y_test_ridge_pred)
--------------------------------------------------
MAE on train is: 5.39936831746395
MAE on test is: 5.397139365030187
--------------------------------------------------
MSE on train is: 51.83821339009539
MSE on test is: 51.85624931397073
--------------------------------------------------
RMSE on train is: 7.199875928798731
RMSE on test is: 7.201128336168626
--------------------------------------------------
MAPE on train is: 6.75625264802869  %
MAPE on test is: 6.759125779863134  %
--------------------------------------------------
R2 on train is: 0.7642276479174975
R2 on test is: 0.7648823012545676
--------------------------------------------------
Accuracy of train is: 93.24374735197131  %
Accuracy of test is: 93.24087422013686  %
--------------------------------------------------
time: 125 ms (started: 2024-09-07 19:37:59 +00:00)

Lasso Regression

In [158]:
# import lasso regression from sklearn library
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV

# Creating Ridge instance
lasso= Lasso()

# Defining parameters
parameters_lasso = {"alpha": [1e-5,1e-4,1e-3,1e-2,1e-1,1,5], "max_iter":[7,8,9,10]}

# Train the model
lassoR = GridSearchCV(lasso, parameters_lasso, scoring='neg_mean_squared_error', cv=5)
lassoR.fit(x_train,y_train)

# Predict the output
y_train_lasso_pred = lassoR.predict(x_train)
y_test_lasso_pred = lassoR.predict(x_test)

# Printing the best parameters obtained by GridSearchCV
print(f"The best alpha value found out to be: {lassoR.best_params_}")
print(f"Negative mean square error is: {lassoR.best_score_}")
The best alpha value found out to be: {'alpha': 1e-05, 'max_iter': 8}
Negative mean square error is: -51.84164296103468
time: 1min 28s (started: 2024-09-07 19:37:59 +00:00)
In [159]:
# Calculating regression metrics for Lasso
regression_metrics(y_train,y_train_lasso_pred,y_test,y_test_lasso_pred)
--------------------------------------------------
MAE on train is: 5.399370056056112
MAE on test is: 5.397141169955692
--------------------------------------------------
MSE on train is: 51.83821341225924
MSE on test is: 51.856248514207515
--------------------------------------------------
RMSE on train is: 7.199875930337914
RMSE on test is: 7.201128280638216
--------------------------------------------------
MAPE on train is: 6.756255374815047  %
MAPE on test is: 6.7591285720161025  %
--------------------------------------------------
R2 on train is: 0.7642276478166912
R2 on test is: 0.7648823048807165
--------------------------------------------------
Accuracy of train is: 93.24374462518496  %
Accuracy of test is: 93.2408714279839  %
--------------------------------------------------
time: 94 ms (started: 2024-09-07 19:39:28 +00:00)

Elastic Net Regression

In [160]:
# import elastic net regression from sklearn library
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV

# Creating e_net instance
e_net= ElasticNet()

# Defining hyperparameters
parameters_e_net = {"alpha": [1e-5,1e-4,1e-3,1e-2,1,5], "max_iter":[12,13,14,15]}

# Train the model
e_netR = GridSearchCV(e_net, parameters_e_net, scoring='neg_mean_squared_error', cv=5)
e_netR.fit(x_train,y_train)

# Predict the output
y_train_e_net_pred = e_netR.predict(x_train)
y_test_e_net_pred = e_netR.predict(x_test)

# Printing the best parameters obtained by GridSearchCV
print(f"The best alpha value found out to be: {e_netR.best_params_}")
print(f"Negative mean square error is: {e_netR.best_score_}")
The best alpha value found out to be: {'alpha': 1e-05, 'max_iter': 14}
Negative mean square error is: -51.84164296969463
time: 1min 20s (started: 2024-09-07 19:39:28 +00:00)
In [161]:
# Calculating regression metrics for Elastic Net
regression_metrics(y_train,y_train_e_net_pred,y_test,y_test_e_net_pred)
--------------------------------------------------
MAE on train is: 5.39936767217231
MAE on test is: 5.3971388124548145
--------------------------------------------------
MSE on train is: 51.83821338523007
MSE on test is: 51.85624922791782
--------------------------------------------------
RMSE on train is: 7.199875928460855
RMSE on test is: 7.201128330193666
--------------------------------------------------
MAPE on train is: 6.756250488878987  %
MAPE on test is: 6.759123701524289  %
--------------------------------------------------
R2 on train is: 0.7642276479396262
R2 on test is: 0.7648823016447339
--------------------------------------------------
Accuracy of train is: 93.24374951112101  %
Accuracy of test is: 93.24087629847571  %
--------------------------------------------------
time: 94 ms (started: 2024-09-07 19:40:49 +00:00)
Which hyperparameter optimization technique have you used and why?¶

We have used GridSearchCV as the hyperparameter optimization technique as it uses all possible combinations of hyperparameters and their values. It then calculates the performance for each combination and selects the best value for the hyperparameters. This offers the most accurate tuning method.

Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.¶

Despite using Lasso, Ridge and Elastic net models, we couldn't see any significant improvement in the r2 score, MSE and on MAPE as well. This provoked us to go for higher and more complex ML models like Decision trees, Random Forest, XGBoost Regression and LightGBM Regression.

ML Model - 2 - Decision Trees¶

Decision Tree

In [162]:
# import the regressor
from sklearn.tree import DecisionTreeRegressor 
  
# create a regressor object
TreeR = DecisionTreeRegressor(max_depth=16) 
  
# fit the regressor with X and Y data
TreeR.fit(x_train, y_train)

# predict the model
y_train_tree_pred= TreeR.predict(x_train)
y_test_tree_pred= TreeR.predict(x_test)
time: 5.31 s (started: 2024-09-07 19:40:49 +00:00)
In [163]:
# Calculating Regression Metrics
regression_metrics(y_train,y_train_tree_pred,y_test,y_test_tree_pred)
--------------------------------------------------
MAE on train is: 2.7697952685956766
MAE on test is: 3.2202328979979327
--------------------------------------------------
MSE on train is: 15.900512861851604
MSE on test is: 20.537404316883553
--------------------------------------------------
RMSE on train is: 3.9875447159689137
RMSE on test is: 4.531821302399682
--------------------------------------------------
MAPE on train is: 3.413559329973217  %
MAPE on test is: 3.9883099056435807  %
--------------------------------------------------
R2 on train is: 0.9276807383667831
R2 on test is: 0.9068828288765332
--------------------------------------------------
Accuracy of train is: 96.58644067002679  %
Accuracy of test is: 96.01169009435642  %
--------------------------------------------------
time: 78 ms (started: 2024-09-07 19:40:54 +00:00)
In [164]:
# Importing libraries for visualizing decison tree
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import tree
from IPython.display import SVG
from graphviz import Source
from IPython.display import display
time: 313 ms (started: 2024-09-07 19:40:54 +00:00)
In [167]:
# # # Visualizing the Decision tree
# plt.figure(figsize=(5,50))
# graph = Source(tree.export_graphviz(TreeR, out_file=None, feature_names=final_df.columns[:-1], class_names=['0', '1'] , filled = True))
# display(SVG(graph.pipe(format='svg')))
time: 0 ns (started: 2024-09-07 19:42:28 +00:00)

1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.¶

In [185]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Assuming y_test is a pandas Series or DataFrame, convert to numpy array
y_test_array = np.array(y_test).flatten()  # Flatten if necessary
y_test_tree_pred = y_test_tree_pred.flatten()  # Flatten y_test_tree_pred if needed

# Ensure the lengths of y_test_array and y_test_tree_pred match
if y_test_array.shape != y_test_tree_pred.shape:
    raise ValueError(f"Shape mismatch: y_test has shape {y_test_array.shape}, but y_test_tree_pred has shape {y_test_tree_pred.shape}")

# Calculating residuals
residuals_DT = y_test_array - y_test_tree_pred

# Compute Mean and Median of residuals
Mean = round(np.mean(residuals_DT), 2)
Median = round(np.median(residuals_DT), 2)

# Plot residuals
plt.figure(figsize=(15, 8))
plt.scatter(y_test_array, residuals_DT, c="lightgreen")
plt.title("Residual Plot for Metric Evaluation")
plt.xlabel('Predicted Sales')
plt.ylabel('Residual Error')

# Add horizontal lines for Mean and Median values
plt.axhline(y=Mean, color='red', linestyle='--', label=f'Mean: {Mean}')
plt.axhline(y=Median, color='green', linestyle='--', label=f'Median: {Median}')
plt.legend()

plt.show()
No description has been provided for this image
time: 5.2 s (started: 2024-09-08 05:17:45 +00:00)

After simple LR models, we tried 'Decision Tree' and we saw a good increment in the r2 score from 0.75 to 0.90 that means "90% Variance of our test dataset is captured by our trained model" which is excellent. On the other side our RMSE also decreased and shifted below 5(=4.7) which is very good.Also accuracy increased from 93% to 95%. On the other hand from the residual plot our values of mean and median are shifting towards 0 that means our model is improving. But, in the quest of more accurate and real predictions, we decided to further tune the hyperparameters and check the results.

2. Cross- Validation & Hyperparameter Tuning¶

Decision Tree with GridSearchCV

In [174]:
# import ridge regression from sklearn library
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV

# Creating Ridge instance
decision_tree= DecisionTreeRegressor()

# Defining parameters
parameters= {'max_depth': [16,17,18], 'min_samples_leaf': [6,7,8], 'min_samples_split': [1,2,4]}

# Train the model
decision_treeR = GridSearchCV(decision_tree, parameters, scoring='neg_mean_squared_error', cv=3)
decision_treeR.fit(x_train,y_train)

# Predict the output
y_train_grid_Dtree_pred = decision_treeR.predict(x_train)
y_test_grid_Dtree_pred = decision_treeR.predict(x_test)

# Printing the best parameters obtained by GridSearchCV
print(f"The best alpha value found out to be: {decision_treeR.best_params_}")
print(f"Negative mean square error is: {decision_treeR.best_score_}")
The best alpha value found out to be: {'max_depth': 18, 'min_samples_leaf': 6, 'min_samples_split': 2}
Negative mean square error is: -19.170815911551106
time: 2min 35s (started: 2024-09-07 19:46:33 +00:00)
In [175]:
# Calculating Regression Metrics
regression_metrics(y_train,y_train_grid_Dtree_pred,y_test,y_test_grid_Dtree_pred)
--------------------------------------------------
MAE on train is: 2.4415311928566963
MAE on test is: 2.924883984798052
--------------------------------------------------
MSE on train is: 12.711055348083626
MSE on test is: 17.710768793369432
--------------------------------------------------
RMSE on train is: 3.5652567015691345
RMSE on test is: 4.208416423474445
--------------------------------------------------
MAPE on train is: 3.010730513757244  %
MAPE on test is: 3.6175633740203494  %
--------------------------------------------------
R2 on train is: 0.9421871391609117
R2 on test is: 0.9196988741608125
--------------------------------------------------
Accuracy of train is: 96.98926948624276  %
Accuracy of test is: 96.38243662597965  %
--------------------------------------------------
time: 78 ms (started: 2024-09-07 19:49:08 +00:00)
Which hyperparameter optimization technique have you used and why?¶

We have used GridSearchCV as the hyperparameter optimization technique as it uses all possible combinations of hyperparameters and provides the more accurate results. It then calculates the performance for each combination and selects the best value for the hyperparameters. This offers the most accurate tuning method.

Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.¶

We have used different combinations of parameters to get the best value of r2 score and least MAPE for our case. The best combination was found out to be {'max_depth': 18, 'min_samples_leaf': 7, 'min_samples_split': 4} which resulted into the improvement in the MSE from 22% to 19% on the test set. Also MAPE is further reduced from 4% to 3% and capturing variance 1% more i.e 91% of the test dataset. At this point of time we have achieved above 95% (=96.30%) accuracy by hyperparameter tuning of Decision trees.

3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.¶

In order to minimise the errors between actual and predicted values, we evaluate our ML model using different metrics. All these metrics try to give us an indication on how close we are with the real/expected output. In our case, each evaluation metric is showing not much difference on the train and test data which shows that our model is predicting a closer expected value. So the sales, the dependent variable, which impacts the business is getting accurately predicted to the extent of ~96% and ~3% far from the mean of actual absolute values.

ML Model - 3 - Random Forest Regression¶

Random Forest

In [176]:
# import the regressor
from sklearn.ensemble import RandomForestRegressor 
  
# create a regressor object
RF_TreeR = RandomForestRegressor(n_estimators=100, max_depth=18) 
  
# fit the regressor with X and Y data
RF_TreeR.fit(x_train, y_train)

# predict the model
y_train_RFtree_pred= RF_TreeR.predict(x_train)
y_test_RFtree_pred= RF_TreeR.predict(x_test)
time: 5min 47s (started: 2024-09-07 19:49:08 +00:00)
In [177]:
# Calculating Regression Metrics using RandomForestRegressor
regression_metrics(y_train,y_train_RFtree_pred,y_test,y_test_RFtree_pred)
--------------------------------------------------
MAE on train is: 2.0123358913754408
MAE on test is: 2.575020416997231
--------------------------------------------------
MSE on train is: 8.583103616299738
MSE on test is: 13.526901273445434
--------------------------------------------------
RMSE on train is: 2.929693433842479
RMSE on test is: 3.677893591914458
--------------------------------------------------
MAPE on train is: 2.4658230249571447  %
MAPE on test is: 3.17333314851201  %
--------------------------------------------------
R2 on train is: 0.9609620317630492
R2 on test is: 0.9386686476433553
--------------------------------------------------
Accuracy of train is: 97.53417697504285  %
Accuracy of test is: 96.82666685148799  %
--------------------------------------------------
time: 109 ms (started: 2024-09-07 19:54:56 +00:00)
In [ ]:
 

1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.¶

In [186]:
# Visualizing evaluation Metric Score chart
# Calculating residuals
y_test_RFtree_pred= y_test_RFtree_pred.reshape(-1,1)
residuals_RF = y_test - y_test_RFtree_pred
Mean= round(np.mean(residuals_RF),2)
Median= round(np.median(residuals_RF),2)

# Plot residuals
plt.figure(figsize=(15,8))
plt.scatter(y_test, residuals_DT, c="tomato")
plt.title("Residual Plot for Metric Evaluation")
plt.xlabel('Predicted Sales')
plt.ylabel('Residual Error')

# Add horizontal line at mean value of y
plt.axhline(y=np.nanmean(residuals_RF), color='red', linestyle='--', label=Mean)
plt.axhline(y=np.nanmedian(residuals_RF), color='green', linestyle='--', label=Median)
plt.legend()

plt.show()
No description has been provided for this image
time: 5 s (started: 2024-09-08 05:21:25 +00:00)

By implimenting using our third model i.e Random Forest we have achieved the r2 score of 0.95 on training and 0.93 on test dataset that is very good MSE also reduced from 19 to 17 and that means our model is moving towards optimal model. Apart from this our MAPE is also reducing gradually as we are using ensemble of Decision trees that is Bootstrap Aggregation(Bagging) so it is giving equal preference each of the feature while spliting.

We got the Mean=0.02 and Median=-0.09 this shows that as our accuracy increases, our mean and median are shifting towards 0 and the residual error is tending toward gaussian distribution.

In order to get the higher accuracy let's perform hyperparameter tuning for the same model and see if we are getting significant results.

In [ ]:
 

2. Cross- Validation & Hyperparameter Tuning¶

Random Forest with RandomizedSearchCV

In [187]:
# import ridge regression from sklearn library
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV

# Creating Ridge instance
RF_tree= RandomForestRegressor()

# Defining parameters
parameters= {'n_estimators':[100], 'max_depth': [19,20], 'min_samples_leaf': [1, 2]}

# Train the model
RF_treeR = RandomizedSearchCV(RF_tree, parameters, n_iter=3, n_jobs=-1, scoring='neg_mean_squared_error', cv=3,  verbose=3)
RF_treeR.fit(x_train,y_train)

# Predict the output
y_train_grid_RFtree_pred = RF_treeR.predict(x_train)
y_test_grid_RFtree_pred = RF_treeR.predict(x_test)

# Printing the best parameters obtained by GridSearchCV
print(f"The best alpha value found out to be: {RF_treeR.best_params_}")
print(f"Negative mean square error is: {RF_treeR.best_score_}")
Fitting 3 folds for each of 3 candidates, totalling 9 fits
The best alpha value found out to be: {'n_estimators': 100, 'min_samples_leaf': 1, 'max_depth': 20}
Negative mean square error is: -13.326300275095756
time: 3h 17min 39s (started: 2024-09-08 05:22:01 +00:00)
In [188]:
# Calculating Regression Metrics using GridSearchCV in RandomForestRegressor
regression_metrics(y_train,y_train_grid_RFtree_pred,y_test,y_test_grid_RFtree_pred)
--------------------------------------------------
MAE on train is: 1.5896297097118546
MAE on test is: 2.417859502759893
--------------------------------------------------
MSE on train is: 5.733342792077885
MSE on test is: 12.328536157440835
--------------------------------------------------
RMSE on train is: 2.394439974624105
RMSE on test is: 3.5112015261788714
--------------------------------------------------
MAPE on train is: 1.9456006567742095  %
MAPE on test is: 2.9839811158367744  %
--------------------------------------------------
R2 on train is: 0.973923412344266
R2 on test is: 0.9441020689196584
--------------------------------------------------
Accuracy of train is: 98.0543993432258  %
Accuracy of test is: 97.01601888416323  %
--------------------------------------------------
time: 4.86 s (started: 2024-09-08 08:39:45 +00:00)
Which hyperparameter optimization technique have you used and why?¶

We have used RandomizedSearchCV in Random Forest since we have huge dataset and it is good for huge and complex models where we just want to select random parameters from the bag of parameters. It reduces the processing and training time by taking the random subsets of the provided parameters wihout compromising the accuracy of the model.

Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.¶

After using RandomizedSearchCV with different hyperparameters we saw that their is not much significant improvement observed. Although MSE on test dataset has been reduced from 14 to 13.

ML Model - 4 - LightGBM Regression¶

LightGBM Regression

In [189]:
# import the regressor
from lightgbm import LGBMRegressor
  
# create a regressor object
lgbmR = LGBMRegressor(boosting_type='gbdt', max_depth=120, learning_rate=0.1, n_estimators=500,  n_jobs=-1) 
  
# fit the regressor with X and Y data
lgbmR.fit(x_train, y_train)

# predict the model
y_train_lgbmR_pred= lgbmR.predict(x_train)
y_test_lgbmrR_pred= lgbmR.predict(x_test)
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.145220 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 754
[LightGBM] [Info] Number of data points in the train set: 675470, number of used features: 15
[LightGBM] [Info] Start training from score 80.193247
time: 50.8 s (started: 2024-09-08 08:39:54 +00:00)
In [190]:
# Calculating Regression Metrics using RandomForestRegressor
regression_metrics(y_train, y_train_lgbmR_pred, y_test, y_test_lgbmrR_pred)
--------------------------------------------------
MAE on train is: 2.91304873439038
MAE on test is: 2.9392378837783384
--------------------------------------------------
MSE on train is: 15.688779171172214
MSE on test is: 16.072587498890375
--------------------------------------------------
RMSE on train is: 3.960906357283925
RMSE on test is: 4.009063169730601
--------------------------------------------------
MAPE on train is: 3.5941393883473514  %
MAPE on test is: 3.624929481469822  %
--------------------------------------------------
R2 on train is: 0.9286437528497654
R2 on test is: 0.9271264344101798
--------------------------------------------------
Accuracy of train is: 96.40586061165264  %
Accuracy of test is: 96.37507051853018  %
--------------------------------------------------
time: 235 ms (started: 2024-09-08 08:40:45 +00:00)

1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.¶

In [191]:
# Visualizing evaluation Metric Score chart
# Calculating residuals
y_test_lgbmrR_pred= y_test_lgbmrR_pred.reshape(-1,1)
residuals_LGBM = y_test - y_test_lgbmrR_pred
Mean= round(np.mean(residuals_LGBM),2)
Median= round(np.median(residuals_LGBM),2)

# Plot residuals
plt.figure(figsize=(15,8))
plt.scatter(y_test, residuals_DT, c="gold")
plt.title("Residual Plot for Metric Evaluation")
plt.xlabel('Predicted Sales')
plt.ylabel('Residual Error')

# Add horizontal line at mean value of y
plt.axhline(y=np.nanmean(residuals_LGBM), color='red', linestyle='--', label=Mean)
plt.axhline(y=np.nanmedian(residuals_LGBM), color='green', linestyle='--', label=Median)
plt.legend()

plt.show()
No description has been provided for this image
time: 9.38 s (started: 2024-09-08 08:40:46 +00:00)

LightGBM is the lighter version of GBM. It has more faster and accurate than other popular gradient boosting libraries such as XGBoost on several datasets. We want to check if our accuracy score can be improved further, so we have tried implimenting LightGBM in order to achieve more accurate results.

We saw that with the help of LightGBM we are able to capture 92% of the Variance of the dependent varibale with the help of independent variables(r2 score) for testing dataset and achieved 96% accuracy. Also we are getting MAPE of 3% which is similar as Random Forest.

By plotting the residual plot we are getting Mean=0.02 and Median=-0.12 and it denotes that residuals are close to normal distribution.

We have further checked the performance metrics by hyperparameter tuning of LightGBM.

2. Cross- Validation & Hyperparameter Tuning¶

LightGBM with RandomizedSearchCV

In [192]:
# import ridge regression from sklearn library and RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV

# Creating XGBoost instance
lgbm= LGBMRegressor()

# Defining parameters
parameters={"learning_rate":[0.01,0.1],"max_depth":[120,125,150],"n_estimators":[500,600]}

# Train the model
lgbm_rand_R= RandomizedSearchCV(lgbm,parameters,scoring='neg_mean_squared_error',n_jobs=-1,cv=3,verbose=3)
lgbm_rand_R.fit(x_train,y_train)

# Predict the output
y_train_rand_lgbm_pred = lgbm_rand_R.predict(x_train)
y_test_rand_lgbm_pred = lgbm_rand_R.predict(x_test)

# Printing the best parameters obtained by GridSearchCV
print(f"The best alpha value found out to be: {lgbm_rand_R.best_params_}")
print(f"Negative mean square error is: {lgbm_rand_R.best_score_}")
Fitting 3 folds for each of 10 candidates, totalling 30 fits
  File "c:\Users\Naanah Hachman\AppData\Local\Programs\Python\Python311\Lib\site-packages\joblib\externals\loky\backend\context.py", line 282, in _count_physical_cores
    raise ValueError(f"found {cpu_count_physical} physical cores < 1")
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.055629 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 754
[LightGBM] [Info] Number of data points in the train set: 675470, number of used features: 15
[LightGBM] [Info] Start training from score 80.193247
The best alpha value found out to be: {'n_estimators': 600, 'max_depth': 120, 'learning_rate': 0.1}
Negative mean square error is: -15.373153727841583
time: 17min 52s (started: 2024-09-08 08:40:56 +00:00)
In [193]:
# Calculating Regression Metrics using GridSearchCV in RandomForestRegressor
regression_metrics(y_train,y_train_rand_lgbm_pred,y_test,y_test_rand_lgbm_pred)
--------------------------------------------------
MAE on train is: 2.808544010473359
MAE on test is: 2.8397578835618305
--------------------------------------------------
MSE on train is: 14.78884339696807
MSE on test is: 15.240751099617366
--------------------------------------------------
RMSE on train is: 3.845626528534469
RMSE on test is: 3.9039404580010397
--------------------------------------------------
MAPE on train is: 3.4681588342246794  %
MAPE on test is: 3.5040521668976616  %
--------------------------------------------------
R2 on train is: 0.9327368718122303
R2 on test is: 0.9308980041345075
--------------------------------------------------
Accuracy of train is: 96.53184116577532  %
Accuracy of test is: 96.49594783310233  %
--------------------------------------------------
time: 2.69 s (started: 2024-09-08 08:59:05 +00:00)
Which hyperparameter optimization technique have you used and why?¶

RandomizedSearchCV was still the better option since it is taking very less processing time without compromising the accuracy. So we have mutually decided to use that hyperparameter optimization technique.

Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.¶

We have tried different parameters for tuning of our LightGBM model and achieved 0.92 r2 score on training dataset, 0.92 on testing set as well that means our model is optimized and not falling under the underfitting or overfitting side. Also, our MAPE is reduced from 7% to 6% after hyperparameter optimization which is fair enough and improved. The best parameters obtained by the optimatization is {'n_estimators': 600, 'max_depth': 150, 'learning_rate': 0.1}.

ML Model - 5 - XGBoost Regression¶

XGBoost Regression

In [195]:
# import the regressor
from xgboost import XGBRegressor
  
# create a regressor object
xgbR = XGBRegressor(learning_rate=0.2, max_depth=10) 
  
# fit the regressor with X and Y data
xgbR.fit(x_train, y_train)

# predict the model
y_train_xgbR_pred= xgbR.predict(x_train)
y_test_xgbR_pred= xgbR.predict(x_test)
time: 20.9 s (started: 2024-09-08 08:59:34 +00:00)
In [196]:
# Calculating Regression Metrics using RandomForestRegressor
regression_metrics(y_train,y_train_xgbR_pred,y_test,y_test_xgbR_pred)
--------------------------------------------------
MAE on train is: 2.300788371097172
MAE on test is: 2.488442144576197
--------------------------------------------------
MSE on train is: 10.564064476456862
MSE on test is: 12.626634994254763
--------------------------------------------------
RMSE on train is: 3.2502406797738628
RMSE on test is: 3.5533976690281603
--------------------------------------------------
MAPE on train is: 2.8503016609467497  %
MAPE on test is: 3.0738452646833228  %
--------------------------------------------------
R2 on train is: 0.9519521594047546
R2 on test is: 0.9427504539489746
--------------------------------------------------
Accuracy of train is: 97.14969833905324  %
Accuracy of test is: 96.92615473531667  %
--------------------------------------------------
time: 203 ms (started: 2024-09-08 08:59:55 +00:00)

1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.¶

In [197]:
# Visualizing evaluation Metric Score chart
# Calculating residuals
y_test_xgbR_pred= y_test_xgbR_pred.reshape(-1,1)
residuals_XG = y_test - y_test_xgbR_pred
Mean= round(np.mean(residuals_XG),2)
Median= round(np.median(residuals_XG),2)

# Plot residuals
plt.figure(figsize=(15,8))
plt.scatter(y_test, residuals_DT, c="mediumslateblue")
plt.title("Residual Plot for Metric Evaluation")
plt.xlabel('Predicted Sales')
plt.ylabel('Residual Error')

# Add horizontal line at mean value of y
plt.axhline(y=np.nanmean(residuals_XG), color='red', linestyle='--', label=Mean)
plt.axhline(y=np.nanmedian(residuals_XG), color='green', linestyle='--', label=Median)
plt.legend()

plt.show()
No description has been provided for this image
time: 2.64 s (started: 2024-09-08 08:59:56 +00:00)

XGBoost (eXtreme Gradient Boosting) is a Gradiant boosting algorithm and very popular for achieving good accuracies. We have used XGBoost in order to shift the MAPE towards 0 (minimize the error).

We got the Mean=0.0 and Median=-0.09 which also super close to 0 that means our residuals are perfectly following normal distribution against Sales.

We got the MAPE of 3% and r2 score of 0.94 for testing dataset which is which is excelent and improved as well. At this point of time slightly improvement in MAPE can lead to huge profit to stakeholders and we were very curious and excited at this point of time to further improve the efficiency of our model and for this we have again decided to tune the various hyperparameters of xgboost.

2. Cross- Validation & Hyperparameter Tuning¶

XGBoost with GridSearchCV

In [198]:
# XGBoost with RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

# Creating XGBoost instance
xgb= XGBRegressor()

# Defining parameters
parameters={"learning_rate":[0.01, 0.1],"max_depth":[12,13]}

# Train the model
xgb_Rand_R= GridSearchCV(xgb,parameters,scoring='neg_mean_squared_error',n_jobs=-1,cv=3,verbose=3)
xgb_Rand_R.fit(x_train,y_train)

# Predict the output
y_train_rand_xgbR_pred = xgb_Rand_R.predict(x_train)
y_test_rand_xgbR_pred = xgb_Rand_R.predict(x_test)

# Printing the best parameters obtained by GridSearchCV
print(f"The best alpha value found out to be: {xgb_Rand_R.best_params_}")
print(f"Negative mean square error is: {xgb_Rand_R.best_score_}")
Fitting 3 folds for each of 4 candidates, totalling 12 fits
The best alpha value found out to be: {'learning_rate': 0.1, 'max_depth': 13}
Negative mean square error is: -12.666820504218125
time: 3min 14s (started: 2024-09-08 08:59:59 +00:00)
In [199]:
# Calculating Regression Metrics using GridSearchCV in RandomForestRegressor
regression_metrics(y_train,y_train_rand_xgbR_pred,y_test,y_test_rand_xgbR_pred)
--------------------------------------------------
MAE on train is: 1.9971611315234616
MAE on test is: 2.408316048417558
--------------------------------------------------
MSE on train is: 8.185973358094907
MSE on test is: 12.039598781294993
--------------------------------------------------
RMSE on train is: 2.8611140064832976
RMSE on test is: 3.469812499443593
--------------------------------------------------
MAPE on train is: 2.4722991097816407  %
MAPE on test is: 2.9756028948509585  %
--------------------------------------------------
R2 on train is: 0.9627683162689209
R2 on test is: 0.9454120993614197
--------------------------------------------------
Accuracy of train is: 97.52770089021836  %
Accuracy of test is: 97.02439710514903  %
--------------------------------------------------
time: 4.61 s (started: 2024-09-08 09:03:14 +00:00)
Which hyperparameter optimization technique have you used and why?¶

XGboost is a heavy algorithm and takes much processing time with GridSearchCV. So, tuning of hyperparameter with GridSearchCV was a bit complicated task for us but we have used less parameterts in GridSearchCV because we did'nt want to miss the best parameter combination.

Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.¶

Minor improvement in regresson metrics are also significant now as we are moving towards model perfection. With the help of GridSearchCV we got the r2 score of 0.94 (Now 94% of the variance of test set our model is capturing) for test dataset which is 1% higher than without RandomizedSearchCV and the best parameters found out to be{'learning_rate': 0.1, 'max_depth': 13}. Also we have noticed that our MAPE is further reduced and falling under 3% (Minimum error among all models) and on the other hand MSE is also reduced to 12%. We have also seen that on further increasing the max_depth of tree our model is overfitting so above values of parameters are the best combinations.

1. Which Evaluation metrics did you consider for a positive business impact and why?¶

Since predicting sales over a period of time falls under the category of "Time series data" and there are following regression metrics that are required as per our goal of analysis (Predicting future Sales):

  1. MAE(Mean Absolute Error): This metric calculates the average magnitude of the errors in the predictions, without considering their direction. It has the inverse relation with the accuracy of the model. In regression analysis our aim is to minimise the MAE and ultimately this will create positive business impact.

  2. RMSE(Root Mean Squared Error): It is the square root of MSE and this is the most widely use regression metric since it has the same units as the original data so it is easy to interpret the magnitude of error.

  3. R2_Score: R2 score(coefficient of determination) is a metric that is widely used in regression analysis because it measures the proportion of the variance in the dependent variable that is explained by the independent variables. R2 score allows analysts to quickly and easily evaluate the goodness of fit of a model and compare different models. It also provides a clear measure of how well the model is explaining the variance in the dependent variable, which can aid in making decisions about model selection and further analysis.

  4. MAPE(Mean Absolute Percentage Error): It is calculated by taking the average of the absolute percentage differences between the predicted values and the actual values. This metric is particularly useful when working with time series data(as in our case), as it allows for easy comparison of forecast accuracy across different scales. With the help of MAPE an analyst can easily explain the percentage error to the stakeholders. This metric is considered as one of the most important regression metric in time series data for a positive business impact.

  5. Accuracy: In time series data(predicting Sales, Customers, Stock prices, etc) the best metric to calculate the accuracy is 100-MAPE, which is the average of the absolute percentage differences between the predicted values and the actual values. A lower value for 100-MAPE indicates a more accurate model.

2. Which ML model did you choose from the above created models as your final prediction model and why?¶

In [ ]:
# Storing different regression metrics in order to make dataframe and compare them
models = ["Linear_regression","Decision_tree","Random_forest","LightGBM","XGboost"]
MAE_r = [5.38,2.94,2.43,2.83,2.37] 
MSE_r = [51.54,17.87,12.47,15.17,11.64]
RMSE_r = [7.17,4.22,3.53,3.89,3.41]
MAPE_r = [6.73,3.63,2.99,3.49,2.93]
r2_r = [0.76,0.91,0.94,0.93,0.94]
accuracy_r = [93.26,96.36,97.00,96.50,97.06]

# Create dataframe from the lists
data = {'Models': models, 
        'MAE': MAE_r,
        'MSE': MSE_r,
        'RMSE': RMSE_r,
        'MAPE': MAPE_r,
        'R2': r2_r,
        'Accuracy': accuracy_r
       }
metric_df = pd.DataFrame(data)

# Printing dataframe
metric_df

We have chosen XGboost as our final prediction model with hyperparameters {'learning_rate': 0.1, 'max_depth': 13} as it is very clear from above dataframe that it has given the highest accuracy (97%), least MAPE (3%) and maximum r2 score(0.94) on the testing dataset among all other models.

3. Explain the model which you have used and the feature importance using any model explainability tool?¶

XGBoost (eXtreme Gradient Boosting) provides an efficient implementation of the gradient boosting framework. It is designed for both linear and tree-based models, and it is useful for large datasets. The basic idea behind XGBoost is to train a sequence of simple models, such as decision trees, and combine their predictions to create a more powerful model. Each tree is trained to correct the errors made by the previous trees in the sequence and this known as boosting.

XGBoost uses a technique called gradient boosting to optimize the parameters of the trees. It minimizes the loss function by adjusting the parameters of the trees in a way that reduces the error of the overall model. XGBoost also includes a number of other features, such as regularization, which helps to prevent overfitting, and parallel processing, which allows for faster training times.

Although tree based algorithm gives most accurate results but they have less explanability. With the help of some explanabilty tools like LIME and SHAP we can explain our model to the stakeholders.

Model Explainablity¶

We can approach Model explainablity by two methods

  1. Globally - how features in the data collectively affect the result. eg. Linear regression.

  2. Locally- It tells us how features individually affect results eg. Shap, LIME

Global Explainability¶
In [202]:
# Plotting the barplot to determine which feature is contributing the most
features = final_df.columns
importances = xgbR.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8,10))
# plt.grid(zorder=0)
plt.title('Feature Importances', fontsize=20)
plt.barh(range(len(indices)), importances[indices], align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance')
Out[202]:
Text(0.5, 0, 'Relative Importance')
No description has been provided for this image
time: 2.34 s (started: 2024-09-08 09:07:57 +00:00)
In [201]:
# Checking the percentage of feature importance
feature_imp = pd.DataFrame(columns = ['Variable','Importance'])
feature_imp.Variable = features[:-1]
feature_imp.Importance = importances*100
feature_imp.sort_values(by="Importance",axis=0,ascending=False)
Out[201]:
Variable Importance
9 StoreType_d 29.727510
1 Customers 25.878794
7 StoreType_b 15.998565
2 Promo 9.566363
6 Promo2 3.636089
5 CompetitionDistance 3.247081
10 Assortment_c 2.500697
8 StoreType_c 2.245669
11 PromoInterval_Feb,May,Aug,Nov 2.070978
12 PromoInterval_Mar,Jun,Sept,Dec 2.029378
0 DayOfWeek 1.014748
14 CompetitionDuration 0.882195
3 StateHoliday 0.550587
13 Day 0.359729
4 SchoolHoliday 0.291624
time: 17.4 s (started: 2024-09-08 09:07:11 +00:00)

As we have considered XGboost as our final optimal model with very good accuracies but still this model is considered as black box model since we don't know actually what is happening inside the algorithm. In order to gain the trust of stakeholders we have to explain the model and under which conditions the model is predicting that particular result with a valid and senseful reason. So, in order to increase the explainability we have plotted the bar plot for decresing sequence of feature importance.

From the above plot it is clear that for XGboost model "StoreType_d" is contributing maximum i.e 34% in the final outcome, "Customers" is contributing 24% and followed by "StoreType_b", "Promo" and "Promo2".

We have achieved accuracy of 85% with the help of linear regression but to attain more accuracy and eventually scale up your business we have done this with the help of random forest, Xgboost, decision tree but they are black box models(cant explain) so we use MODEL EXPLAINABLITY tool SHAP.

Explanability using SHAP¶

SHAP (Shapley Additive exPlanations) It is used to calculate the impact of each feature of the model on the final result.

Here we are using TreeExplainer (for the analysis of decision trees_

Explaining decision tree with ForcePlot

In [203]:
# Storing our features into new variable
feature = final_df.columns[:-1]
time: 172 ms (started: 2024-09-08 09:08:18 +00:00)
In [204]:
#checking the values in the list
feature
Out[204]:
Index(['DayOfWeek', 'Customers', 'Promo', 'StateHoliday', 'SchoolHoliday',
       'CompetitionDistance', 'Promo2', 'StoreType_b', 'StoreType_c',
       'StoreType_d', 'Assortment_c', 'PromoInterval_Feb,May,Aug,Nov',
       'PromoInterval_Mar,Jun,Sept,Dec', 'Day', 'CompetitionDuration'],
      dtype='object')
time: 547 ms (started: 2024-09-08 09:08:22 +00:00)
In [205]:
# Checking the first observation
x_test[0:1]
Out[205]:
array([[ 1.43855312, -0.50441843, -0.89789395, -0.03284707, -0.48994393,
         0.98225648, -0.99734346, -0.13702054,  2.54443201, -0.66476152,
         1.07611818, -0.36236   , -0.32497293,  0.01892054,  2.18203551]])
time: 656 ms (started: 2024-09-08 09:08:29 +00:00)
In [206]:
import shap 
#create an explainer for a tree-based model, i.e., random forest (rf) 
explainer = shap.TreeExplainer(xgbR) 
shap_values = explainer.shap_values(x_test[0:1])  #pass the first test sample
time: 1min 9s (started: 2024-09-08 09:09:56 +00:00)
In [207]:
shap.initjs() #initialize the JavaScript visualization in the notebook environment 
shap.force_plot(explainer.expected_value, shap_values=shap_values[0], features = feature)
No description has been provided for this image
Out[207]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
time: 3.23 s (started: 2024-09-08 09:11:15 +00:00)
In [208]:
y_test_rand_xgbR_pred[0:1]
Out[208]:
array([80.117805], dtype=float32)
time: 31 ms (started: 2024-09-08 09:12:19 +00:00)
In [209]:
y_test[0:1]
Out[209]:
Sales
660791 82.413591
time: 47 ms (started: 2024-09-08 09:12:26 +00:00)
In [210]:
x_test[0:1]
Out[210]:
array([[ 1.43855312, -0.50441843, -0.89789395, -0.03284707, -0.48994393,
         0.98225648, -0.99734346, -0.13702054,  2.54443201, -0.66476152,
         1.07611818, -0.36236   , -0.32497293,  0.01892054,  2.18203551]])
time: 0 ns (started: 2024-09-08 09:12:36 +00:00)

Now the first value in the index is DayOfWeek so we will explain our model prediction on the basis of this.

  1. We can see that the base line value is 80.22 and our function is predicting 91.07 which is quite good.
  2. customers and storeType_d are mainly responsible for that Right shift And From the opposite direction Promo and Day ofWeek acts on it to partially balance it out.
In [211]:
#pass the second test sample
shap_values = explainer.shap_values(x_test[1:2])
time: 47 ms (started: 2024-09-08 09:12:43 +00:00)
In [212]:
#initialize the JavaScript visualization in the notebook environment 
shap.initjs() 
shap.force_plot(explainer.expected_value, shap_values=shap_values[0], features = feature)
No description has been provided for this image
Out[212]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
time: 16 ms (started: 2024-09-08 09:12:46 +00:00)

Now we are taking the second index value to justify our model which is Customers, How far average prediction(Base value) away from prediction done by customer feature. The Base value is 80.22 and the predicted value is around 73.15

In [213]:
#pass the third test sample
shap_values = explainer.shap_values(x_test[2:3])
time: 62 ms (started: 2024-09-08 09:13:26 +00:00)
In [214]:
#initialize the JavaScript visualization in the notebook environment 
shap.initjs()
shap.force_plot(explainer.expected_value, shap_values=shap_values[0], features = feature)
No description has been provided for this image
Out[214]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
time: 16 ms (started: 2024-09-08 09:13:28 +00:00)

Understanding the project importance for the stakeholders involved¶

To compete in a rapidly changing business landscape, retail stores are under increasing pressure to anticipate market trends and customer behavior. This exact demand is what our analysis is for. Our project involves various stakeholders such as Rossman Store, Customers, Inventory Managers, Society at large etc.

The insights from our project highlights how the incorporation of appropriate machine learning models( in our case XGBoost) into their data analytics, Rossman Store gains far more accurate and powerful capabilities for forecasting demand, which translates into more effective inventory management and big cost savings.

One of the most important segment for any retail business is Customer Retention. Rossman store can rely on our machine learning models to identify customers who may be ready to shop at their stores repeatedly.

Another important insight from our project is Sentiment Analysis. Promo2, PromoIntervals etc and our analysis around them can be used to discover how people feel about their brand or product.

By allowing Rossman Stores to process and analyze data more quickly than ever before, our machine learning analysis enables rapid quick and more accurate Decision Making. By enabling fast decisions about inventory, it can help Rossman Stores to uphold their business reputations, and avoid costly corrective measures.

8. Future Work (Optional)¶

To store our final model(object) such that we can use that in future we will use pikle module. Pickle module implements binary protocols for serializing and de-serializing a Python object structure. “Pickling” is the process where a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, where a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy. Pickling (and unpickling) is alternatively known as “serialization”, “marshalling,” or “flattening”; however, to avoid confusion, the terms used here are “pickling” and “unpickling”.

Pickle in Python is primarily used in serializing and deserializing a Python object structure. In other words, it's the process of converting a Python object into a byte stream to store it in a file/database, maintain program state across sessions, or transport data over the network

1. Save the best performing ml model in a pickle file or joblib file format for deployment process.¶

In [215]:
# Importing pickle module
import pickle
time: 0 ns (started: 2024-09-08 09:13:55 +00:00)
In [216]:
# Save the File
filename='Rossmann_regression.pkl'

# serialize process (wb=write byte)
pickle.dump(xgb_Rand_R,open(filename,'wb'))
time: 750 ms (started: 2024-09-08 09:13:58 +00:00)

2. Again Load the saved model file and try to predict unseen data for a sanity check.¶

In [217]:
# Load the File and predict unseen data.

# unserialize process (rb=read byte)
Regression_model= pickle.load(open(filename,'rb'))

# Predicting the unseen data(test set)
Regression_model.predict(x_test)
Out[217]:
array([ 80.117805, 104.76687 , 107.7133  , ...,  49.700115,  60.44205 ,
        59.113503], dtype=float32)
time: 2.75 s (started: 2024-09-08 09:14:03 +00:00)
In [218]:
# Checking if we are getting the same predicted values
y_test_rand_xgbR_pred
Out[218]:
array([ 80.117805, 104.76687 , 107.7133  , ...,  49.700115,  60.44205 ,
        59.113503], dtype=float32)
time: 0 ns (started: 2024-09-08 09:14:12 +00:00)

Great, we have saved our model as pickle file and can be used in future after deserialization.

Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!¶

Conclusion¶

Exploratory Data Analysis (EDA) is an important step because it allows for the initial investigation of a dataset. It helps to identify patterns, anomalies, and relationships in the data, as well as to detect any potential issues such as missing values or outliers. EDA also helps to generate hypotheses and inform the development of more advanced modeling techniques, such as machine learning. Additionally, it is a good way to understand the data, which is crucial for good decision making. EDA also helps to provide a deeper understanding of the data and helps to guide the direction of further analysis. After performing EDA we have drawn the following conclusions:

  • Sales vs Customers graph shows positive correlation between 'Sales' and 'Customers'. As the number of customers increases, the sales also tend to increase.
  • December being a festive month attracts more sale than the rest of the months. Also, November has slightly more sales than other months. This could be due to the 'Black Friday' sale which is very popular across the globe. As Rossmann Stores deals in health and beauty products, it can be guessed that November and December sales are due to the celebratory nature of people who love to buy beauty/health products leading to the sudden increase in sales.
  • Day 1 and day 7 witness the highest sale indicating they are probably days falling on the weekend. Day 2 to day 6 generate medium to low sales indicating they are probably weekdays where customer footfall is low.
  • As stores are getting promoted, more sales are getting generated.
  • Plot between StateHolidays and sales shows that during Easter and Christmas holiday sales are actually high but for other holiday such as public ,sales are comparatively low.
  • Sales for the store type b is the highest . Store type B might be located in a more affluent or high-traffic area, which would increase the number of potential customers. Store type B may have a more favorable layout, which makes it more attractive to customers and makes it easier for them to find the products they want, resulting in more sales.
  • Sales are highest for the assortment b . This assortment may have a good mix of products that are in high demand or that are unique to the store, which would result in more sales.
  • We observed that mostly the competitor stores weren't that far from each other and the stores densely located near each other saw more sales.
  • We can conclude that Sales are high during the year 1900, as there are very few store were operated of Rossmann so there is less competition and sales are high. But as year pass on number of stores increased that means competition also increased and this leads to decline in the sales.
  • From plot sales and competition Open Since Month shows sales go increasing from Novemmber and highest in month December. This may be due to Christmas eve and New Year.
  • We can conclude from sales vs promo2 graph that customers are slightly less responsive to the stores(i.e sales) that are running consecutive promotions. One possibility could be customers might have already taken advantage of a similar promotion earlier. Another reason could be store might not have invested enough in promoting the promotion to customers, resulting in lower awareness and fewer sales. Also, if the store is running same promotion again and again, it could have resulted into lower customer footfal and ultimately leadind to fewer sales.
  • Sales vs Promo2SinceYear barplot explains that sales were still the highest when the store wasn't running any consecutive promotional events. But in 2014, the sales were really shoot up and they are recorded as 2nd highest. Good quality products, better deals, shutdown of competitions etc could be the reasons.
  • We can see the promo interval Jan, Apr, Jul, Oct records the 2nd highest sales as it marks the festive season. However, the other intervals are recording sales that are close to the 1st interval.

Conclusions drawn from ML Model Implementation¶

Close predictions of any ML model highly impacts the business growth. Before going to further model deployment one should have to check how accurately the model is predicting and performing with the real life data.

Conclusions drawn from any model are very helpful to identify wheather the model is fully baked and good to go for deployment process or needs further refinement. In this section first we will talk about some general points that are essential for every ML model and then will talk about the project oriented conclusions we made:

General conclusions:

  • The implementation of an ML model can greatly improve the performance and accuracy of a system or application.
  • It is important to carefully select and preprocess the data used for training and testing the model.
  • Regular evaluation and tuning of the model is necessary to ensure optimal performance.
  • The use of appropriate evaluation metrics can help to measure the performance of the model.
  • The integration of the model into the system or application should be done in a way that allows for easy maintenance and updates.
  • The ethical and legal considerations of the model's use should also be taken into account.

Project conclusions:

We have implemented various regression model started with Linear Regression and then we have tried other non linear models too. For each of the model we have tried to tune the hyperparameters as well in order to minimize the errors and drawn following conclusions

  • In Linear Regression we got the accuracy of ~93% and the model is capturing 75% of variance even after using regularization techniques that means our data is not perfectly linearly dependent with target variable(Sales).
  • For Decsion Tree we have achieved ~96.3% accuracy with maximum depth of 18 and on increasing the depth over it we are falling towards overfitting and MAPE of 3.6% which ultimately increases the mean absolute percentage error.
  • Giving preference to each of the variable always results in better accuracy as small subsets can provide significant accuracy percentage. Ensemble technique i.e Random Forest has given the accuray of ~96.96% with total trees of 100 in the forest with hyperparameter tuning.
  • We have also tried gradiant boosting technique with LightGBM although we got the similar results as Random Forest(~96.4%) but we got the more fast results as it has used all the cores and decrease processing time. While training the large dataset one should try LightGBM for good results in less time.
  • At last we have implemented our final model i.e XGboost and achieved the accuracy of 97% with mean absolute percentage error of only 2%. Also we got the mean and median of residuals at 0.09 which is indicating towards excellent residual plot.

Since, a good residual plot have the following characteristics:

  1. It should contain high density of points close to the origin and a low density of points away from the origin.
  2. It is symmetric about the origin.

The residual plot obtained from XGboost of fulfilling both the characteristics and from the above experiments and identifications we have choosen the XGboost as our final optimal model for deployment.